Skip to content

Multi-Agent Evaluation

Overview

Multi-agent evaluation enables systematic testing of agents through simulated user interactions and automated scoring. This pattern combines simulated user agents, evaluator agents, and metrics aggregation to provide objective quality assessment.

Architecture

When to Use

Use multi-agent evaluation when:

  • Systematic testing needed: Manual testing doesn't scale
  • Quality metrics required: Need objective, consistent scores
  • Multiple scenarios: Test agent with various user behaviors
  • Regression testing: Ensure changes don't degrade quality
  • A/B testing: Compare different agent implementations

Key Components

1. State Schema

python
from typing import Annotated
from typing_extensions import TypedDict
from langgraph.graph.message import add_messages
import operator

class EvaluationState(TypedDict):
    messages: Annotated[list, add_messages]              # Conversation history
    conversation: str                                     # Formatted for evaluator
    evaluator_scores: Annotated[list[dict], operator.add]  # Accumulated scores
    turn_count: int                                       # Current turn
    max_turns: int                                        # Maximum turns
    session_complete: bool                                # Done flag
    final_metrics: dict[str, float]                       # Aggregated metrics

2. Simulated User Configuration

Define realistic user behavior:

python
from pydantic import BaseModel, Field
from typing import Literal

class SimulatedUser(BaseModel):
    persona: str = Field(
        description="User's background and situation"
    )
    goals: list[str] = Field(
        description="What the user wants to accomplish"
    )
    behavior: Literal["friendly", "impatient", "confused", "technical", "casual"] = "friendly"
    initial_message: str | None = None

# Example
user = SimulatedUser(
    persona="Frustrated customer with damaged product",
    goals=["Get refund", "Express dissatisfaction"],
    behavior="impatient",
    initial_message="My order arrived damaged and I want a refund!"
)

3. Evaluation Criteria

Structured scoring with Pydantic:

python
class EvaluationCriteria(BaseModel):
    helpfulness: int = Field(ge=1, le=5, description="How helpful? (1-5)")
    accuracy: int = Field(ge=1, le=5, description="How accurate? (1-5)")
    empathy: int = Field(ge=1, le=5, description="How empathetic? (1-5)")
    efficiency: int = Field(ge=1, le=5, description="How efficient? (1-5)")
    goal_completion: int = Field(ge=0, le=1, description="Goals met? (0/1)")
    reasoning: str = Field(description="Explanation of scores")

4. Simulated User Node

Creates user messages based on persona:

python
from langgraph_ollama_local.patterns.evaluation import create_simulated_user_node

user_config = SimulatedUser(
    persona="Tech-savvy customer",
    goals=["Troubleshoot issue"],
    behavior="technical"
)

user_node = create_simulated_user_node(llm, user_config)

5. Evaluator Node

Scores conversations objectively:

python
from langgraph_ollama_local.patterns.evaluation import create_evaluator_node

evaluator = create_evaluator_node(llm)
# Uses structured output to ensure consistent scoring

6. Graph Construction

Orchestrate agent, user, and evaluator:

python
from langgraph_ollama_local.patterns.evaluation import create_evaluation_graph

graph = create_evaluation_graph(
    llm,
    agent_node,              # Agent being tested
    user_config,             # Simulated user config
    evaluate_every_n_turns=2 # Score every 2 turns
)

Usage

Basic Evaluation

python
from langgraph_ollama_local import LocalAgentConfig
from langgraph_ollama_local.patterns.evaluation import (
    SimulatedUser,
    create_evaluation_graph,
    run_evaluation_session,
)

config = LocalAgentConfig()
llm = config.create_chat_client()

# Define agent to test
def my_agent(state):
    # Agent implementation
    return {"messages": [AIMessage(content="How can I help?")]}

# Configure simulated user
user = SimulatedUser(
    persona="Customer with billing question",
    goals=["Understand charge", "Get refund if incorrect"],
    behavior="friendly"
)

# Create and run evaluation
graph = create_evaluation_graph(llm, my_agent, user)
result = run_evaluation_session(graph, max_turns=10)

print(result["final_metrics"])
# {
#   'helpfulness_avg': 4.2,
#   'accuracy_avg': 4.0,
#   'empathy_avg': 4.5,
#   'efficiency_avg': 3.8,
#   'goal_completion_rate': 1.0,
#   'num_scores': 5
# }

Multiple Evaluation Sessions

Test with multiple runs for robustness:

python
from langgraph_ollama_local.patterns.evaluation import run_multiple_evaluations

results = run_multiple_evaluations(
    graph,
    num_sessions=5,    # Run 5 evaluation sessions
    max_turns=10
)

# Access aggregate metrics
print(results["aggregate_metrics"])
# {
#   'helpfulness_avg': 4.1,
#   'accuracy_avg': 4.2,
#   'empathy_avg': 4.3,
#   'efficiency_avg': 3.9,
#   'goal_completion_rate': 0.8,
#   'num_sessions': 5
# }

# Access individual session results
for i, session in enumerate(results["sessions"]):
    print(f"Session {i+1}: {session['final_metrics']}")

Testing Different Scenarios

python
# Scenario 1: Impatient customer
impatient_user = SimulatedUser(
    persona="Frustrated customer waiting 2 weeks for refund",
    goals=["Get immediate refund", "Express anger"],
    behavior="impatient"
)

# Scenario 2: Confused user
confused_user = SimulatedUser(
    persona="Elderly user not tech-savvy",
    goals=["Track order", "Understand website"],
    behavior="confused"
)

# Scenario 3: Technical user
tech_user = SimulatedUser(
    persona="Developer troubleshooting API",
    goals=["Debug API error", "Get technical details"],
    behavior="technical"
)

# Test all scenarios
for user_config in [impatient_user, confused_user, tech_user]:
    graph = create_evaluation_graph(llm, my_agent, user_config)
    result = run_evaluation_session(graph)
    print(f"Scenario: {user_config.persona}")
    print(f"Metrics: {result['final_metrics']}\n")

Custom Metrics Aggregation

python
from langgraph_ollama_local.patterns.evaluation import aggregate_scores

# Manually aggregate scores
scores = [
    {"helpfulness": 4, "accuracy": 5, "empathy": 3, "efficiency": 4, "goal_completion": 1},
    {"helpfulness": 5, "accuracy": 4, "empathy": 4, "efficiency": 5, "goal_completion": 1},
]

metrics = aggregate_scores(scores)
print(metrics)
# {
#   'helpfulness_avg': 4.5,
#   'accuracy_avg': 4.5,
#   'empathy_avg': 3.5,
#   'efficiency_avg': 4.5,
#   'goal_completion_rate': 1.0,
#   'num_scores': 2
# }

Evaluation Metrics

Score Dimensions

MetricRangeDescription
Helpfulness1-5How helpful were the agent's responses?
Accuracy1-5How factually correct was the information?
Empathy1-5How empathetic was the agent?
Efficiency1-5How concise and direct were responses?
Goal Completion0-1Were user's goals achieved?

Aggregated Metrics

From aggregate_scores():

  • helpfulness_avg - Average helpfulness score
  • accuracy_avg - Average accuracy score
  • empathy_avg - Average empathy score
  • efficiency_avg - Average efficiency score
  • goal_completion_rate - Percentage of goals completed
  • num_scores - Number of evaluations aggregated

Best Practices

1. Diverse Test Scenarios

Test multiple user types:

python
scenarios = [
    ("friendly", "Happy customer with simple question"),
    ("impatient", "Frustrated customer demanding refund"),
    ("confused", "User struggling with technical interface"),
    ("technical", "Developer needing API documentation"),
]

for behavior, persona in scenarios:
    user = SimulatedUser(persona=persona, goals=[...], behavior=behavior)
    # Run evaluation

2. Clear User Goals

Define specific, measurable objectives:

python
# Good: Specific and measurable
goals = ["Get refund processed", "Receive confirmation number", "Understand timeline"]

# Bad: Vague
goals = ["Be satisfied", "Resolve issue"]

3. Multiple Evaluation Runs

Run 3-5 sessions per scenario for reliability:

python
results = run_multiple_evaluations(graph, num_sessions=5)
# More reliable than single run

4. Periodic Evaluation

Score conversations as they progress:

python
# Evaluate every 2 turns to track quality over time
graph = create_evaluation_graph(
    llm, agent, user,
    evaluate_every_n_turns=2  # Not just at the end
)

5. Track Improvements

Compare metrics before/after changes:

python
# Before changes
baseline_metrics = run_evaluation_session(graph_v1)

# After changes
improved_metrics = run_evaluation_session(graph_v2)

# Compare
for metric in ["helpfulness_avg", "empathy_avg"]:
    delta = improved_metrics[metric] - baseline_metrics[metric]
    print(f"{metric}: {delta:+.2f}")

Common Patterns

A/B Testing Agents

python
def compare_agents(agent_a, agent_b, user_config, num_sessions=5):
    """Compare two agent implementations."""

    graph_a = create_evaluation_graph(llm, agent_a, user_config)
    graph_b = create_evaluation_graph(llm, agent_b, user_config)

    results_a = run_multiple_evaluations(graph_a, num_sessions)
    results_b = run_multiple_evaluations(graph_b, num_sessions)

    print("Agent A vs Agent B:")
    for metric in ["helpfulness_avg", "accuracy_avg", "empathy_avg"]:
        a_score = results_a["aggregate_metrics"][metric]
        b_score = results_b["aggregate_metrics"][metric]
        print(f"{metric}: {a_score:.2f} vs {b_score:.2f}")

Regression Testing

python
def regression_test(agent, test_suite, threshold=4.0):
    """Ensure agent meets quality threshold."""

    all_passing = True

    for scenario_name, user_config in test_suite.items():
        graph = create_evaluation_graph(llm, agent, user_config)
        result = run_evaluation_session(graph)

        avg_score = (
            result["final_metrics"]["helpfulness_avg"] +
            result["final_metrics"]["accuracy_avg"]
        ) / 2

        passing = avg_score >= threshold
        all_passing = all_passing and passing

        print(f"{scenario_name}: {'PASS' if passing else 'FAIL'} ({avg_score:.2f})")

    return all_passing

Custom Evaluator

Create domain-specific evaluators:

python
class TechnicalSupportCriteria(BaseModel):
    """Custom criteria for technical support."""
    troubleshooting_quality: int = Field(ge=1, le=5)
    technical_accuracy: int = Field(ge=1, le=5)
    documentation_provided: int = Field(ge=0, le=1)
    escalation_appropriate: int = Field(ge=0, le=1)
    reasoning: str

def create_tech_support_evaluator(llm):
    structured_llm = llm.with_structured_output(TechnicalSupportCriteria)

    def evaluator(state):
        # Custom evaluation logic
        ...

    return evaluator

Common Pitfalls

PitfallSolution
Single test scenarioTest multiple user behaviors and personas
Vague user goalsDefine specific, measurable objectives
Single evaluation runRun 3-5 sessions for statistical reliability
Only end-of-conversation scoringUse periodic evaluation to track quality
Ignoring edge casesTest impatient, confused, and difficult users
No baseline comparisonTrack metrics over time to measure improvement
Fixed max_turns too lowAllow enough turns for realistic conversations

Quiz

Test your understanding of multi-agent evaluation:

Knowledge Check

What is the primary purpose of the simulated user agent in multi-agent evaluation?

ATo completely replace human testers in all scenarios
BTo generate realistic user interactions for systematic, repeatable agent testing
CTo make agents run faster during evaluation
DTo reduce API costs by caching responses

Knowledge Check

Why should you run multiple evaluation sessions instead of just one?

ATo use more API credits for better results
BTo make the evaluation code more complex
CTo get statistically reliable results and account for LLM response variability
DTo test more features in a single run

Knowledge Check

What makes a good user goal for evaluation purposes?

AVague and open-ended goals like Be satisfied
BSpecific and measurable goals like Get refund processed
CTechnical and complex goals requiring domain expertise
DLong and detailed goals covering many scenarios

Knowledge Check

What are the five evaluation metrics measured in the EvaluationCriteria schema?

ASpeed, accuracy, cost, latency, throughput
BHelpfulness, accuracy, empathy, efficiency, goal_completion
CGrammar, spelling, tone, length, formatting
DResponse time, token count, model version, temperature, context length

Knowledge Check

What is the recommended approach for testing agent quality across different user types?

ATest only with friendly, cooperative users
BTest with a single challenging user persona
CTest with diverse scenarios including friendly, impatient, confused, and technical users
DSkip user persona testing and focus only on task completion

End of Multi-Agent Patterns