How to evaluate an AI agent: build a golden-task eval harness from scratch

Change a prompt, upgrade a model, add a tool — did the agent get better or worse? Without evals the honest answer is “nobody knows,” and teams ship on vibes until something visible breaks. The fix doesn’t start with an eval platform; it starts with golden tasks and a for-loop, which is what we’ll build. Platforms come later, when you outgrow this file.

Step 1 — What’s actually hard about evaluating agents

Classic ML evaluation compares outputs to labels. Agents produce trajectories — sequences of tool calls ending in an outcome — so you need two kinds of checks:

Outcome checks: did the final answer contain/equal/achieve the right thing?
Trajectory checks: did it get there acceptably — right tools, no forbidden calls, within the turn budget?

Both reduce to functions over a result object. No magic required.

Step 2 — Golden tasks as data

Create evals.py. Tasks are data, not code — you’ll add dozens over time:

GOLDEN_TASKS = [
    {
        "id": "weather-simple",
        "prompt": "What's it like in Zurich right now?",
        "expect_contains": ["Zurich", "18°C"],
        "expect_tools": ["get_weather"],
        "max_turns": 4,
    },
    {
        "id": "no-tool-needed",
        "prompt": "Say hello.",
        "expect_contains": ["hello"],
        "forbid_tools": ["get_weather", "get_time"],
        "max_turns": 2,
    },
]

The second task matters more than the first: agents fail by doing too much as often as too little. Always include “should NOT act” cases.

Step 3 — The harness

The harness runs each task against your agent and applies the checks. It needs the agent to return both the answer and the trajectory — our agent-loop tutorial’s history list is exactly that:

import sys

def run_task(agent_fn, task):
    answer, trajectory = agent_fn(task["prompt"], max_turns=task["max_turns"])
    tools_used = [step["tool"] for step in trajectory if step.get("tool")]
    failures = []

    for needle in task.get("expect_contains", []):
        if needle.lower() not in answer.lower():
            failures.append(f"answer missing {needle!r}")
    for tool in task.get("expect_tools", []):
        if tool not in tools_used:
            failures.append(f"never called {tool!r}")
    for tool in task.get("forbid_tools", []):
        if tool in tools_used:
            failures.append(f"called forbidden tool {tool!r}")

    return failures

def main(agent_fn):
    results = {}
    for task in GOLDEN_TASKS:
        failures = run_task(agent_fn, task)
        results[task["id"]] = failures
        mark = "PASS" if not failures else "FAIL: " + "; ".join(failures)
        print(f"{task['id']:<20} {mark}")

    passed = sum(1 for f in results.values() if not f)
    print(f"\n{passed}/{len(results)} tasks passed")
    sys.exit(0 if passed == len(results) else 1)

That sys.exit is the whole point: an eval suite that can’t fail a CI build is a dashboard, not a gate.

Step 4 — Wire in an agent and run it

For the tutorial, adapt the mock agent from the agent-loop guide to return (answer, trajectory):

def demo_agent(prompt, max_turns=10):
    # Deterministic stand-in with the same interface your real agent needs.
    if "zurich" in prompt.lower():
        trajectory = [{"tool": "get_weather", "result": "18°C and clear in Zurich"}]
        return "Zurich right now: 18°C and clear.", trajectory
    return "Well hello there!", []

if __name__ == "__main__":
    main(demo_agent)

python evals.py

Both tasks pass, exit code 0. Now break it on purpose — make demo_agent call get_weather for the hello task — and watch the forbidden-tool check catch it. An eval you’ve never seen fail is an eval you can’t trust.

Step 5 — Where this grows next

Nondeterminism: run each task N times with a real model; report pass rates, and set thresholds per task (“must pass 4/5 runs”).
Fuzzy outcomes: when substring checks get too brittle, add an LLM-as-judge check — but keep deterministic checks for everything they can cover; judges drift, in doesn’t.
Coverage: every production incident becomes a golden task. Your suite should read like a history of everything that ever went wrong.

Troubleshooting

Tasks pass locally, fail in CI (or vice versa)

Hunt down nondeterminism: model temperature, wall-clock-dependent tools, or shared state between tasks. Evals must construct a fresh agent per task — reused sessions leak context and make results order-dependent.

Every prompt change breaks half the expect_contains checks

Your checks are testing phrasing, not outcomes. Assert on facts (“18°C”), IDs, or tool effects — never on sentence structure. If a check can be broken by a synonym, it’s too tight.

Next in this learning path How to wrap a REST API in an MCP server (without handing the agent the whole API)