Why evaluate tools?

When deploying language models (LLMs) with tool-calling capabilities in production environments, it's essential to ensure their effectiveness and reliability. This evaluation process goes beyond traditional testing and focuses on two key aspects:

Tool Utilization: Assessing how efficiently the LLM uses the available tools.
Intent Understanding: Evaluating the LLM's ability to comprehend user intents and select the appropriate tools to fulfill those intents.

Arcade AI's Evaluation Framework provides a comprehensive approach to assess and validate the tool-calling capabilities of LLMs, ensuring they meet the high standards required for real-world applications.

Why Evaluate Tool Calling by Task?

LLMs augmented with tool-use capabilities can perform complex tasks by invoking external tools or APIs. However, without proper evaluation, these models might:

Misinterpret user intents, leading to incorrect tool selection.
Provide incorrect arguments to tools, causing failures or undesired outcomes.
Fail to execute the necessary sequence of tool calls, especially in tasks requiring multiple steps.

Evaluating tool calling by task ensures that the LLM can handle specific scenarios reliably, providing confidence in its performance in production settings.

Evaluation Scoring

Scoring in the evaluation framework is based on comparing the model's actual tool calls with the expected ones for each evaluation case. The total score for a case depends on:

Tool Selection: Whether the model selected the correct tools for the task.
Tool Call Arguments: The correctness of the arguments provided to the tools, evaluated by critics.
Evaluation Rubric: Each aspect of the evaluation is weighted according to the rubric, affecting its impact on the final score.

The evaluation result includes:

Score: A normalized value between 0.0 and 1.0.
Result:
- Passed: Score is above the fail threshold.
- Failed: Score is below the fail threshold.
- Warned: Score is between the warning and fail thresholds.

Critics: Types and Usage

Critics are essential for evaluating the correctness of tool call arguments. Different types of critics serve various evaluation needs:

BinaryCritic

BinaryCritics check for exact matches between expected and actual values after casting.

Use Case: When exact values are required (e.g., specific numeric parameters).
Example: Ensuring the model provides the exact user ID in a function call.

NumericCritic

NumericCritic evaluates numeric values within a specified range, allowing for acceptable deviations.

Use Case: When values can be approximate but should be within a certain threshold.
Example: Accepting approximate results in mathematical computations due to floating-point precision.

SimilarityCritic

SimilarityCritic measures the similarity between expected and actual string values using metrics like cosine similarity.

Use Case: When the exact wording isn't critical, but the content should be similar.
Example: Evaluating if the message content in a communication tool is similar to the expected message.

DatetimeCritic

DatetimeCritic evaluates the closeness of datetime values within a specified tolerance.

Use Case: When datetime values should be within a certain range of the expected time.
Example: Verifying if a scheduled event time is close enough to the intended time.

Example Usage

Supported metrics

Cosine Similarity: Measures the angle between two vectors in a multi-dimensional space.

Choosing the Right Critic

Exact Matches Needed: Use BinaryCritic for strict equality.
Numeric Ranges: Use NumericCritic when a tolerance is acceptable.
Textual Similarity: Use SimilarityCritic for comparing messages or descriptions.
Datetime Tolerance: Use DatetimeCritic when a tolerance is acceptable for datetime comparisons.

Critics are defined with fields such as critic_field, weight, and parameters specific to their types (e.g., similarity_threshold for SimilarityCritic).

Rubrics and Setting Thresholds

An EvalRubric defines the evaluation criteria and thresholds for determining pass/fail outcomes. Key components include:

Fail Threshold: The minimum score required to pass the evaluation.
Warn Threshold: The score threshold for issuing a warning.
Weights: Assigns importance to different aspects of the evaluation (e.g., tool selection, argument correctness).

Setting Up a Rubric

Define Fail and Warn Thresholds: Choose values between 0.0 and 1.0 to represent acceptable performance levels.
Assign Weights: Allocate weights to tool selection and critics to reflect their importance in the overall evaluation.
Configure Failure Conditions: Set flags like fail_on_tool_selection to enforce strict criteria.

Example Rubric Configuration:

A rubric that requires a score of at least 0.85 to pass and issues a warning if the score is between 0.85 and 0.95:

Fail Threshold: 0.85
Warn Threshold: 0.95
Fail on Tool Selection: True
Tool Selection Weight: 1.0

rubric = EvalRubric(
    fail_threshold=0.85,
    warn_threshold=0.95,
    fail_on_tool_selection=True,
    tool_selection_weight=1.0,
)

Building an Evaluation Suite

An EvalSuite orchestrates the running of multiple evaluation cases. Here's how to build one:

Initialize EvalSuite: Provide a name, system message, tool catalog, and rubric.
Add Evaluation Cases: Use add_case or extend_case to include various scenarios.
Specify Expected Tool Calls: Define the tools and arguments expected for each case.
Assign Critics: Attach critics relevant to each case to evaluate specific arguments.
Run the Suite: Execute the suite using the Arcade CLI to collect results.

Example: Math Tools Evaluation Suite

An evaluation suite for math tools might include cases such as:

Adding Two Large Numbers:
- User Message: "Add 12345 and 987654321"
- Expected Tool Call: add(a=12345, b=987654321)
- Critics:
  - BinaryCritic for arguments a and b
Calculating Square Roots:
- User Message: "What is the square root of 3224990521?"
- Expected Tool Call: sqrt(a=3224990521)
- Critics:
  - BinaryCritic for argument a

Example: Slack Messaging Tools Evaluation Suite

An evaluation suite for Slack messaging tools might include cases such as:

Sending a Direct Message:
- User Message: "Send a direct message to johndoe saying 'Hello, can we meet at 3 PM?'"
- Expected Tool Call: send_dm_to_user(user_name='johndoe', message='Hello, can we meet at 3 PM?')
- Critics:
  - BinaryCritic for user_name
  - SimilarityCritic for message
Posting a Message to a Channel:
- User Message: "Post 'The new feature is now live!' in the #announcements channel"
- Expected Tool Call: send_message_to_channel(channel_name='announcements', message='The new feature is now live!')
- Critics:
  - BinaryCritic for channel_name
  - SimilarityCritic for message

Retry tools with improved prompt Create an evaluation suite