Run evaluations with the Arcade AI CLI

The Arcade AI Evaluation Framework allows you to run evaluations of your tool-enabled language models conveniently using the command-line interface (CLI). This enables you to execute your evaluation suites, gather results, and analyze the performance of your models in an efficient and streamlined manner.

Using the `arcade evals` Command

To run evaluations, use the arcade evals command provided by the Arcade CLI. This command searches for evaluation files in the specified directory, executes any functions decorated with @tool_eval, and displays the results.

Basic Usage

arcade evals <directory>

<directory>: The directory containing your evaluation files. By default, it searches the current directory (.).

For example, to run evaluations in the current directory:

arcade evals .

The Arcade AI Evaluation Framework also supports running a single evaluation file:

arcade evals <eval_your_file.py>

Evaluation File Naming Convention

The arcade evals command looks for Python files that start with eval_ and end with .py (e.g., eval_math_tools.py, eval_slack_messaging.py). These files should contain your evaluation suites.

Command Options

The arcade evals command supports several options to customize the evaluation process:

--details, -d: Show detailed results for each evaluation case, including critic feedback.

Example:
```
arcade evals . --details
```
--models, -m: Specify the models to use for evaluation. Provide a comma-separated list of model names.

Example:
```
arcade evals . --models gpt-4o,gpt-3.5
```
--max-concurrent, -c: Set the maximum number of concurrent evaluations to run in parallel.

Example:
```
arcade evals . --max-concurrent 4
```
--host, -h: Specify the Arcade Engine address to send evaluation requests to.
--port, -p: Specify the port of the Arcade Engine.
--tls: Force TLS for the connection to the Arcade Engine.
--no-tls: Disable TLS for the connection to the Arcade Engine.

Example Command

Running evaluations in the toolkits/math/evals directory, showing detailed results, using the gpt-4o model:

arcade evals toolkits/math/evals --details --models gpt-4o

Execution Process

When you run the arcade evals command, the following steps occur:

Preparation: The CLI loads the evaluation suites from the specified directory, looking for files that match the naming convention.
Execution: The evaluation suites are executed asynchronously. Each suite's evaluation function, decorated with @tool_eval, is called with the appropriate configuration, including the model and concurrency settings.
Concurrency: Evaluations can run concurrently based on the --max-concurrent setting, improving efficiency.
Result Aggregation: Results from all evaluation cases and models are collected and aggregated.

Displaying Results

After the evaluations are complete, the results are displayed in a concise and informative format, similar to testing frameworks like pytest. The output includes:

Summary: Shows the total number of cases, how many passed, failed, or issued warnings.

Example:
```
Summary -- Total: 5 -- Passed: 4 -- Failed: 1
```
Detailed Case Results: For each evaluation case, the status (PASSED, FAILED, WARNED), the case name, and the score are displayed.

Example:
```
PASSED Add two large numbers -- Score: 1.00
FAILED Send DM with ambiguous username -- Score: 0.75
```
Critic Feedback: If the --details flag is used, detailed feedback from each critic is provided, highlighting matches, mismatches, and scores for each evaluated field.

Example:
```
Details:
user_name:
  Match: False, Score: 0.00/0.50
  Expected: johndoe
  Actual: john_doe
message:
  Match: True, Score: 0.50/0.50
```

Interpreting the Results

Passed: The evaluation case met or exceeded the fail threshold specified in the rubric.
Failed: The evaluation case did not meet the fail threshold.
Warnings: If the score is between the warn threshold and the fail threshold, a warning is issued.

Use the detailed feedback to understand where the model's performance can be improved, particularly focusing on mismatches identified by critics.

Customizing Evaluations

You can customize the evaluation process by adjusting:

Rubrics: Modify fail and warn thresholds, and adjust weights to emphasize different aspects of evaluation.
Critics: Add or modify critics in your evaluation cases to target specific arguments or behaviors.
Concurrency: Adjust the --max-concurrent option to optimize performance based on your environment.

Handling Multiple Models

You can evaluate multiple models in a single run by specifying them in the --models option as a comma-separated list. This allows you to compare the performance of different models across the same evaluation suites.

Example:

arcade evals . --models gpt-4o,gpt-3.5

Considerations

Engine Availability: Ensure the Arcade Engine is running and accessible. You can specify the host and port if running the engine locally or on a different server.
Authentication: Make sure you are logged in and have the necessary API keys configured.
Evaluation Files: Ensure your evaluation files are correctly named and contain the evaluation suites decorated with @tool_eval.

Conclusion

Running evaluations using the Arcade CLI provides a powerful and convenient way to assess the tool-calling capabilities of your language models. By leveraging the arcade evals command, you can efficiently execute your evaluation suites, analyze results, and iterate on your models and tool integrations.

Integrating this evaluation process into your development workflow helps ensure that your models interact with tools as expected, enhances reliability, and builds confidence in deploying actionable LLMs in production environments.

Create an evaluation suite Build an actor with FastAPI