Tutorials Builder
How to evaluate an AI agent: build a golden-task eval harness from scratch
Agents without evals break silently on every prompt tweak. Build a small golden-task harness in plain Python — tasks, checks, scoring, and a regression gate you can run in CI.
Change a prompt, upgrade a model, add a tool — did the agent get better or worse? Without evals the honest answer is “nobody knows,” and teams ship on vibes until something visible breaks. The fix doesn’t start with an eval platform; it starts with golden tasks and a for-loop, which is what we’ll build. Platforms come later, when you outgrow this file.
Step 1 — What’s actually hard about evaluating agents
Classic ML evaluation compares outputs to labels. Agents produce trajectories — sequences of tool calls ending in an outcome — so you need two kinds of checks:
- Outcome checks: did the final answer contain/equal/achieve the right thing?
- Trajectory checks: did it get there acceptably — right tools, no forbidden calls, within the turn budget?
Both reduce to functions over a result object. No magic required.
Step 2 — Golden tasks as data
Create evals.py. Tasks are data, not code — you’ll add dozens over time:
GOLDEN_TASKS = [
{
"id": "weather-simple",
"prompt": "What's it like in Zurich right now?",
"expect_contains": ["Zurich", "18°C"],
"expect_tools": ["get_weather"],
"max_turns": 4,
},
{
"id": "no-tool-needed",
"prompt": "Say hello.",
"expect_contains": ["hello"],
"forbid_tools": ["get_weather", "get_time"],
"max_turns": 2,
},
]
The second task matters more than the first: agents fail by doing too much as often as too little. Always include “should NOT act” cases.
Step 3 — The harness
The harness runs each task against your agent and applies the checks. It
needs the agent to return both the answer and the trajectory — our
agent-loop tutorial’s history
list is exactly that:
import sys
def run_task(agent_fn, task):
answer, trajectory = agent_fn(task["prompt"], max_turns=task["max_turns"])
tools_used = [step["tool"] for step in trajectory if step.get("tool")]
failures = []
for needle in task.get("expect_contains", []):
if needle.lower() not in answer.lower():
failures.append(f"answer missing {needle!r}")
for tool in task.get("expect_tools", []):
if tool not in tools_used:
failures.append(f"never called {tool!r}")
for tool in task.get("forbid_tools", []):
if tool in tools_used:
failures.append(f"called forbidden tool {tool!r}")
return failures
def main(agent_fn):
results = {}
for task in GOLDEN_TASKS:
failures = run_task(agent_fn, task)
results[task["id"]] = failures
mark = "PASS" if not failures else "FAIL: " + "; ".join(failures)
print(f"{task['id']:<20} {mark}")
passed = sum(1 for f in results.values() if not f)
print(f"\n{passed}/{len(results)} tasks passed")
sys.exit(0 if passed == len(results) else 1)
That sys.exit is the whole point: an eval suite that can’t fail a CI
build is a dashboard, not a gate.
Step 4 — Wire in an agent and run it
For the tutorial, adapt the mock agent from the agent-loop guide to return
(answer, trajectory):
def demo_agent(prompt, max_turns=10):
# Deterministic stand-in with the same interface your real agent needs.
if "zurich" in prompt.lower():
trajectory = [{"tool": "get_weather", "result": "18°C and clear in Zurich"}]
return "Zurich right now: 18°C and clear.", trajectory
return "Well hello there!", []
if __name__ == "__main__":
main(demo_agent)
python evals.py
Both tasks pass, exit code 0. Now break it on purpose — make demo_agent
call get_weather for the hello task — and watch the forbidden-tool check
catch it. An eval you’ve never seen fail is an eval you can’t trust.
Step 5 — Where this grows next
- Nondeterminism: run each task N times with a real model; report pass rates, and set thresholds per task (“must pass 4/5 runs”).
- Fuzzy outcomes: when substring checks get too brittle, add an
LLM-as-judge check — but keep deterministic checks for everything they
can cover; judges drift,
indoesn’t. - Coverage: every production incident becomes a golden task. Your suite should read like a history of everything that ever went wrong.
Troubleshooting
Tasks pass locally, fail in CI (or vice versa)
Hunt down nondeterminism: model temperature, wall-clock-dependent tools, or shared state between tasks. Evals must construct a fresh agent per task — reused sessions leak context and make results order-dependent.
Every prompt change breaks half the expect_contains checks
Your checks are testing phrasing, not outcomes. Assert on facts (“18°C”), IDs, or tool effects — never on sentence structure. If a check can be broken by a synonym, it’s too tight.
Was this guide useful?
Thanks — noted. It shapes what gets written next.
newsletter
One practical agentic-AI guide in your inbox. No news, no hype.
Tutorials and decision frameworks as they ship. Unsubscribe anytime.