Run evaluations with the Arcade AI CLI
The Arcade AI Evaluation Framework allows you to run evaluations of your tool-enabled language models conveniently using the command-line interface (CLI). This enables you to execute your evaluation suites, gather results, and analyze the performance of your models in an efficient and streamlined manner.
Using the arcade evals
Command
To run evaluations, use the arcade evals
command provided by the Arcade CLI. This command searches for evaluation files in the specified directory, executes any functions decorated with @tool_eval
, and displays the results.
Basic Usage
arcade evals <directory>
<directory>
: The directory containing your evaluation files. By default, it searches the current directory (.
).
For example, to run evaluations in the current directory:
arcade evals .
The Arcade AI Evaluation Framework also supports running a single evaluation file:
arcade evals <eval_your_file.py>
Evaluation File Naming Convention
The arcade evals
command looks for Python files that start with eval_
and end with .py
(e.g., eval_math_tools.py
, eval_slack_messaging.py
). These files should contain your evaluation suites.
Command Options
The arcade evals
command supports several options to customize the evaluation process:
-
--details
,-d
: Show detailed results for each evaluation case, including critic feedback.Example:
arcade evals . --details
-
--models
,-m
: Specify the models to use for evaluation. Provide a comma-separated list of model names.Example:
arcade evals . --models gpt-4o,gpt-3.5
-
--max-concurrent
,-c
: Set the maximum number of concurrent evaluations to run in parallel.Example:
arcade evals . --max-concurrent 4
-
--host
,-h
: Specify the Arcade Engine address to send evaluation requests to. -
--port
,-p
: Specify the port of the Arcade Engine. -
--tls
: Force TLS for the connection to the Arcade Engine. -
--no-tls
: Disable TLS for the connection to the Arcade Engine.
Example Command
Running evaluations in the toolkits/math/evals
directory, showing detailed results, using the gpt-4o
model:
arcade evals toolkits/math/evals --details --models gpt-4o
Execution Process
When you run the arcade evals
command, the following steps occur:
-
Preparation: The CLI loads the evaluation suites from the specified directory, looking for files that match the naming convention.
-
Execution: The evaluation suites are executed asynchronously. Each suite's evaluation function, decorated with
@tool_eval
, is called with the appropriate configuration, including the model and concurrency settings. -
Concurrency: Evaluations can run concurrently based on the
--max-concurrent
setting, improving efficiency. -
Result Aggregation: Results from all evaluation cases and models are collected and aggregated.
Displaying Results
After the evaluations are complete, the results are displayed in a concise and informative format, similar to testing frameworks like pytest
. The output includes:
-
Summary: Shows the total number of cases, how many passed, failed, or issued warnings.
Example:
Summary -- Total: 5 -- Passed: 4 -- Failed: 1
-
Detailed Case Results: For each evaluation case, the status (PASSED, FAILED, WARNED), the case name, and the score are displayed.
Example:
PASSED Add two large numbers -- Score: 1.00 FAILED Send DM with ambiguous username -- Score: 0.75
-
Critic Feedback: If the
--details
flag is used, detailed feedback from each critic is provided, highlighting matches, mismatches, and scores for each evaluated field.Example:
Details: user_name: Match: False, Score: 0.00/0.50 Expected: johndoe Actual: john_doe message: Match: True, Score: 0.50/0.50
Interpreting the Results
-
Passed: The evaluation case met or exceeded the fail threshold specified in the rubric.
-
Failed: The evaluation case did not meet the fail threshold.
-
Warnings: If the score is between the warn threshold and the fail threshold, a warning is issued.
Use the detailed feedback to understand where the model's performance can be improved, particularly focusing on mismatches identified by critics.
Customizing Evaluations
You can customize the evaluation process by adjusting:
-
Rubrics: Modify fail and warn thresholds, and adjust weights to emphasize different aspects of evaluation.
-
Critics: Add or modify critics in your evaluation cases to target specific arguments or behaviors.
-
Concurrency: Adjust the
--max-concurrent
option to optimize performance based on your environment.
Handling Multiple Models
You can evaluate multiple models in a single run by specifying them in the --models
option as a comma-separated list. This allows you to compare the performance of different models across the same evaluation suites.
Example:
arcade evals . --models gpt-4o,gpt-3.5
Considerations
-
Engine Availability: Ensure the Arcade Engine is running and accessible. You can specify the host and port if running the engine locally or on a different server.
-
Authentication: Make sure you are logged in and have the necessary API keys configured.
-
Evaluation Files: Ensure your evaluation files are correctly named and contain the evaluation suites decorated with
@tool_eval
.
Conclusion
Running evaluations using the Arcade CLI provides a powerful and convenient way to assess the tool-calling capabilities of your language models. By leveraging the arcade evals
command, you can efficiently execute your evaluation suites, analyze results, and iterate on your models and tool integrations.
Integrating this evaluation process into your development workflow helps ensure that your models interact with tools as expected, enhances reliability, and builds confidence in deploying actionable LLMs in production environments.