Evaluation Framework
Evaluation Framework¶
Simplified evaluation framework for testing voice agent scenarios.
Architecture¶
┌───────────────────────────────────────────────────────────────┐
│ CLI (run / submit) │
└─────────────────────────────┬─────────────────────────────────┘
│
┌───────────────────┴───────────────────┐
▼ ▼
┌───────────────────────┐ ┌───────────────────────┐
│ ScenarioRunner │ │ ComparisonRunner │
│ (single YAML) │◀──────────────│ (A/B tests) │
└───────────┬───────────┘ └───────────────────────┘
│ Creates ScenarioRunner
│ per variant
▼
┌───────────────────────┐ ┌───────────────────────┐
│ EvalOrchestratorWrap │────▶│ EventRecorder │
│ (event capture) │ │ (JSONL writer) │
└───────────┬───────────┘ └───────────────────────┘
│
▼
┌───────────────────────┐ ┌───────────────────────┐
│ MetricsScorer │────▶│ FoundryExporter │
│ (precision/recall) │ │ (cloud eval) │
└───────────────────────┘ └───────────────────────┘
Core Components¶
| Component | File | Purpose |
|---|---|---|
ScenarioRunner |
scenario_runner.py |
Loads YAML, runs turns, generates summary |
ComparisonRunner |
scenario_runner.py |
Runs variants, generates comparison report |
EvaluationOrchestratorWrapper |
wrappers.py |
Captures events during orchestration |
EventRecorder |
recorder.py |
Writes turn events to JSONL |
MetricsScorer |
scorer.py |
Computes tool precision/recall, latency |
ExpectationValidator |
validator.py |
Validates turn results against expectations |
FoundryExporter |
foundry_exporter.py |
Exports to Azure AI Foundry format |
Quick Start¶
# Interactive evaluation menu (recommended for exploration)
make eval
# Run a single scenario with streaming output
make eval-run SCENARIO=tests/evaluation/scenarios/smoke/basic_identity_verification.yaml
# Or use Python directly
python tests/evaluation/run-eval-stream.py run --input tests/evaluation/scenarios/smoke/basic_identity_verification.yaml
# Run all session-based scenarios
make eval-session
# Run smoke tests (quick validation)
make eval-smoke
# Run A/B comparisons
make eval-ab
CLI Reference¶
Interactive CLI¶
Launch the interactive evaluation menu for browsing and running scenarios:
Features: - Browse scenarios by category (smoke, session-based, A/B tests) - View scenario details before running - Quick-run previously executed scenarios - View recent evaluation results
run - Execute Scenarios¶
Runs a scenario or A/B comparison with streaming per-turn output.
python tests/evaluation/run-eval-stream.py run --input <yaml_file>
# Or via Makefile:
make eval-run SCENARIO=<yaml_file>
| Option | Description |
|---|---|
--input, -i |
Path to scenario YAML file (required) |
Examples:
# Run smoke test
make eval-run SCENARIO=tests/evaluation/scenarios/smoke/basic_identity_verification.yaml
# Run session-based scenario
python tests/evaluation/run-eval-stream.py run -i tests/evaluation/scenarios/session_based/banking_multi_agent.yaml
# Run A/B comparison
make eval-run SCENARIO=tests/evaluation/scenarios/ab_tests/fraud_detection_comparison.yaml
submit - Upload to Azure AI Foundry¶
Submits evaluation results to Azure AI Foundry for cloud-based evaluation.
| Option | Description |
|---|---|
--data, -d |
Path to foundry_eval.jsonl or directory containing it (required) |
--config, -c |
Path to evaluator config (auto-detected if next to data file) |
--endpoint, -e |
Azure AI Foundry project endpoint |
--dataset-name |
Custom name for uploaded dataset |
--evaluation-name |
Custom name for evaluation run |
--model-deployment, -m |
Model for AI evaluators (default: gpt-4o) |
Example:
python tests/evaluation/foundry_exporter.py \
--data runs/smoke_basic_identity_1737849600/foundry_eval.jsonl \
--endpoint "https://your-project.api.azureml.ms"
Scenario YAML Format¶
Single Scenario¶
Test a single agent configuration with multiple conversation turns.
scenario_name: smoke_basic_identity
description: Verify basic agent functionality
# Reference a pre-defined scenario from scenariostore (optional)
scenario_template: banking
# Define session configuration inline
session_config:
agents:
- BankingConcierge
start_agent: BankingConcierge
handoffs: []
generic_handoff:
enabled: false
# Conversation turns to execute
turns:
- turn_id: turn_1
user_input: "Hello, I need help with my account."
expectations:
tools_called: []
response_constraints:
must_include_any: ["help", "assist"]
- turn_id: turn_2
user_input: "My name is John Smith and my last four SSN is 1234."
expectations:
tools_called:
- verify_client_identity
# Pass/fail thresholds
thresholds:
min_tool_precision: 0.5
min_tool_recall: 0.5
max_latency_p95_ms: 15000
A/B Comparison (ComparisonRunner)¶
Compare multiple model configurations using the same conversation turns. The ComparisonRunner executes each variant sequentially and generates a comparison report.
comparison_name: gpt4o_vs_o3_banking
description: Compare GPT-4o vs GPT-5.1 for banking scenarios
# Reference scenario template for agent discovery
scenario_template: banking
# Define variants - each runs the same turns with different models
variants:
- variant_id: gpt4o_baseline
agent_overrides:
- agent: BankingConcierge
model_override:
deployment_id: gpt-4o
temperature: 0.6
max_tokens: 200
- agent: CardRecommendation
model_override:
deployment_id: gpt-4o
temperature: 0.6
- variant_id: gpt51_challenger
agent_overrides:
- agent: BankingConcierge
model_override:
deployment_id: gpt-5.1
max_completion_tokens: 2000
reasoning_effort: low
- agent: CardRecommendation
model_override:
deployment_id: gpt-5.1
reasoning_effort: medium
# Shared turns executed for each variant
turns:
- turn_id: turn_1
user_input: "I need to check a charge. My name is Alice Brown, SSN 1234."
expectations:
tools_called:
- verify_client_identity
- turn_id: turn_2
user_input: "Can you suggest a better rewards card?"
expectations:
tools_called:
- handoff_to_agent
# Metrics to compare across variants
comparison_metrics:
- latency_p95_ms
- tool_precision
- tool_recall
- grounded_span_ratio
- cost_per_turn
# Thresholds apply to all variants
thresholds:
min_tool_precision: 0.05
min_tool_recall: 0.10
max_latency_p95_ms: 20000
Model Profiles (DRY Configuration)¶
Use model_profiles to define reusable configurations:
model_profiles:
gpt4o_fast:
deployment_id: gpt-4o
temperature: 0.6
max_tokens: 200
o3_reasoning:
deployment_id: o3-mini
reasoning_effort: low
max_completion_tokens: 2000
variants:
- variant_id: baseline
model_profile: gpt4o_fast # Applied to ALL agents
- variant_id: challenger
model_profile: o3_reasoning
# Per-agent overrides merge on top of profile
agent_overrides:
- agent: InvestmentAdvisor
model_override:
reasoning_effort: medium # Override just this field
How ComparisonRunner Works¶
The ComparisonRunner orchestrates A/B tests:
- Load comparison YAML - Parses variants, turns, and thresholds
- Resolve model profiles - Expands
model_profilereferences toagent_overrides - For each variant:
- Creates a temporary scenario file
- Instantiates a
ScenarioRunnerwith variant-specific model overrides - Runs all turns and records events
- Generates per-variant summary
- Compare results - Aggregates metrics and determines winners per metric
- Output comparison report - Saves
comparison.jsonwith detailed breakdown
Output structure:
runs/<comparison_name>/
├── comparison.json # Summary with winners per metric
├── gpt4o_baseline/
│ ├── <run_id>_events.jsonl
│ └── <run_id>/
│ ├── summary.json
│ └── session.json
└── gpt51_challenger/
├── <run_id>_events.jsonl
└── <run_id>/
├── summary.json
└── session.json
Console output:
======================================================================
📊 COMPARISON: gpt4o_vs_o3_banking
======================================================================
▶ gpt4o_baseline:
Primary Model: gpt-4o
Per-turn metrics:
turn_1: BankingConcierge | 1234ms | expected=[verify_client_identity] actual=[verify_client_identity] ✓
turn_2: BankingConcierge | 890ms | expected=[handoff_to_agent] actual=[handoff_to_agent] ✓
Aggregated:
Turns: 2
Precision: 100.00%
Recall: 100.00%
Latency P50/P95: 1062ms / 1234ms
Cost/turn: $0.0023
▶ gpt51_challenger:
...
🏆 Winners:
winner_latency_p95_ms: gpt4o_baseline
winner_tool_precision: gpt4o_baseline
winner_cost_per_turn: gpt4o_baseline
📁 Results: runs/gpt4o_vs_o3_banking/comparison.json
======================================================================
Expectations¶
Define expected behavior for each turn:
| Field | Description |
|---|---|
tools_called |
List of tools that MUST be called |
tools_optional |
Tools that MAY be called (no penalty if missing) |
response_constraints.must_include_any |
Response must contain at least one |
response_constraints.must_not_include |
Response must NOT contain any |
response_constraints.latency_threshold_ms |
Max allowed latency |
Example:
turns:
- turn_id: turn_1
user_input: "Check my balance"
expectations:
tools_called:
- verify_client_identity
- get_account_balance
tools_optional:
- get_user_profile
response_constraints:
must_include_any: ["balance", "account"]
must_not_include: ["error", "sorry"]
latency_threshold_ms: 5000
Thresholds¶
Set pass/fail criteria for scenarios:
thresholds:
min_tool_precision: 0.8 # Called tools must be expected
min_tool_recall: 0.8 # Expected tools must be called
min_grounded_ratio: 0.5 # Response grounded in tool results
max_latency_p95_ms: 10000 # 95th percentile latency limit
Output Files¶
After running a scenario:
runs/<scenario_name>/
├── <run_id>_events.jsonl # Raw turn events (JSONL format)
└── <run_id>/
├── summary.json # Aggregated metrics
├── session.json # Session manifest
└── foundry_eval.jsonl # Foundry-format data (if configured)
Azure AI Foundry Integration¶
Enable cloud-based evaluation with AI evaluators:
foundry_export:
enabled: true
output_filename: foundry_eval.jsonl
context_source: evidence # Use tool results as context
evaluators:
- id: builtin.relevance
init_params:
deployment_name: gpt-4o
data_mapping:
query: "${data.query}"
response: "${data.response}"
context: "${data.context}"
- id: builtin.coherence
init_params:
deployment_name: gpt-4o
Then submit:
python tests/evaluation/foundry_exporter.py \
--data runs/my_scenario/ \
--endpoint "https://your-project.api.azureml.ms"
Running with pytest (End-to-End)¶
The pytest-based runner provides end-to-end evaluation with built-in Foundry submission support.
pytest CLI Options¶
| Option | Description |
|---|---|
--submit-to-foundry |
Submit results to Azure AI Foundry after running |
--foundry-endpoint |
Azure AI Foundry project endpoint (overrides env var) |
--eval-output-dir |
Output directory for results (default: runs/) |
--eval-model |
Model deployment for AI-based Foundry evaluators (default: gpt-4o) |
Basic Usage¶
# Run all A/B comparison tests
pytest tests/evaluation/test_scenarios.py -v -m evaluation
# Run all session-based scenario tests
pytest tests/evaluation/test_scenarios.py::test_session_scenario_e2e -v
# Run specific scenario by name
pytest tests/evaluation/test_scenarios.py -k "fraud_detection" -v
# Skip slow tests (use existing data only)
pytest tests/evaluation/test_scenarios.py -m "not slow"
Running with Foundry Submission¶
# Run A/B tests and submit results to Foundry
pytest tests/evaluation/test_scenarios.py::test_ab_comparison_e2e \
--submit-to-foundry \
--foundry-endpoint "https://your-project.api.azureml.ms" \
-v
# Or set endpoint via environment variable
export AZURE_AI_FOUNDRY_PROJECT_ENDPOINT="https://your-project.api.azureml.ms"
pytest tests/evaluation/test_scenarios.py --submit-to-foundry -v
# Run session scenarios with Foundry submission
pytest tests/evaluation/test_scenarios.py::test_session_scenario_e2e \
--submit-to-foundry \
--foundry-endpoint "https://your-project.api.azureml.ms" \
--eval-model gpt-4o \
-v
Test Functions¶
| Test | Description | Markers |
|---|---|---|
test_ab_comparison_e2e |
Full A/B comparison with validation and thresholds | evaluation, slow |
test_session_scenario_e2e |
Session-based multi-agent scenarios | evaluation, slow |
test_expectations_from_existing_data |
Fast validation on existing A/B data | evaluation |
test_session_expectations_from_existing_data |
Fast validation on existing session data | evaluation |
TestEvaluationMetrics |
Threshold checks on existing A/B comparison data | evaluation |
TestSessionMetrics |
Threshold checks on existing session data | evaluation |
Currently discovered scenarios:
- A/B Tests:
fraud_detection_comparison - Session-based:
all_agents_discovery,banking_multi_agent
E2E Test Workflow¶
Each E2E test (test_ab_comparison_e2e, test_session_scenario_e2e) follows this workflow:
- Run scenario/comparison - Executes all turns against live agents
- Validate expectations - Checks tools called, handoffs, response constraints
- Submit to Foundry (if
--submit-to-foundry) - Uploads data BEFORE assertions - Assert expectations pass - Fails test if any turn expectations fail
- Assert metric thresholds - Validates precision, recall, latency, groundedness
Environment Variable Overrides¶
Override default thresholds via environment variables:
# Set custom thresholds
export EVAL_MIN_PRECISION=0.8
export EVAL_MIN_RECALL=0.7
export EVAL_MAX_LATENCY_MS=5000
export EVAL_MIN_GROUNDED=0.5
pytest tests/evaluation/test_scenarios.py -v
Fast Iteration Mode¶
Use expectation-only tests to iterate quickly without re-running scenarios:
# First, run full E2E to generate data
pytest tests/evaluation/test_scenarios.py::test_ab_comparison_e2e -k "fraud_detection" -v
# Then iterate on expectations using existing data (fast)
pytest tests/evaluation/test_scenarios.py::test_expectations_from_existing_data -k "fraud_detection" -v
Sample Output with Foundry Submission¶
$ pytest tests/evaluation/test_scenarios.py::test_ab_comparison_e2e \
--submit-to-foundry \
--foundry-endpoint "https://my-project.api.azureml.ms" \
-v
============================= test session starts ==============================
collected 1 item
test_scenarios.py::test_ab_comparison_e2e[fraud_detection_comparison]
🚀 Running E2E A/B comparison: fraud_detection_comparison.yaml
▶ gpt4o_baseline:
Per-turn metrics:
turn_1: FraudDetection | 1234ms | expected=[verify_client_identity] actual=[verify_client_identity] ✓
turn_2: FraudDetection | 890ms | expected=[check_fraud_alert] actual=[check_fraud_alert] ✓
📤 Submitting gpt4o_baseline to Foundry: runs/fraud_detection_comparison/.../foundry_eval.jsonl
✅ Foundry submission complete for gpt4o_baseline
🔗 View in portal: https://ai.azure.com/project/.../evaluation/...
▶ gpt51_challenger:
...
✅ All variants pass thresholds
PASSED
Troubleshooting¶
No agents discovered:
- Ensure scenario_template references a valid scenario in scenariostore/
- Or define session_config.agents list explicitly
Model override not applied:
- Check agent_overrides uses correct agent names (case-sensitive)
- Verify deployment_id exists in your Azure OpenAI resource
Foundry submission fails:
The test reads the Foundry endpoint from these sources (in order):
1. --foundry-endpoint CLI option
2. AZURE_AI_FOUNDRY_PROJECT_ENDPOINT environment variable
3. App Configuration (via azure/ai-foundry/project-endpoint key)
To configure:
# Option A: Add to .env.local (recommended - uses existing App Config pattern)
echo 'AZURE_AI_FOUNDRY_PROJECT_ENDPOINT=https://your-project.api.azureml.ms' >> .env.local
# Option B: Set environment variable directly
export AZURE_AI_FOUNDRY_PROJECT_ENDPOINT="https://your-project.api.azureml.ms"
# Option C: Pass via CLI
pytest tests/evaluation/test_scenarios.py --submit-to-foundry --foundry-endpoint "https://your-project.api.azureml.ms"
Find your endpoint in Azure Portal > Azure AI Foundry > Your Project > Settings > Properties.
pytest --submit-to-foundry fails with missing endpoint:
- Test now fails loudly if --submit-to-foundry is used without a valid endpoint
- Add endpoint to .env.local or pass --foundry-endpoint argument
- Endpoint format: https://<project>.api.azureml.ms or https://<region>.api.azureml.ms
No studio_url in Foundry result: - Storage account must be linked to the Azure AI Foundry project - Check project permissions include "Azure AI Developer" role