How to choose an LLM for your agent: Claude, GPT, Gemini, or open-weights
General benchmarks predict agent performance poorly. The six criteria that matter for agentic workloads, a tiering strategy that beats single-model loyalty, and the eval-driven way to decide.
The wrong question is “which model is best?” — the leaderboard answer changes quarterly and mostly measures things agents don’t do. The right question is “which model is best at each job inside my system?” That one has a stable method, even as the names shift.
Why general benchmarks mislead for agents
An agent asks a model to do three unusual things: emit precisely formatted tool calls over long trajectories, notice and recover when a tool fails, and know when to stop. General chat benchmarks measure none of these. Two models with near-identical benchmark scores can differ wildly in multi-step tool reliability — and that difference, compounded over a 15-step trajectory, is the whole ballgame: 98% per-step reliability is ~74% per-trajectory; 90% is ~21%.
The six criteria that actually matter
- Tool-calling reliability — valid calls, correct arguments, sustained over many turns. The dominant factor.
- Error recovery — shown a failure, does it adapt or repeat itself?
- Instruction persistence — does the system prompt still bind on turn 30, or has it drifted?
- Long-context behavior — not the advertised window; the quality of attention across a full, tool-result-stuffed context.
- Latency & cost shape — agents multiply per-call latency by every step; a slow frontier model can lose to a fast good-enough one.
- Deployment constraints — data residency, privacy, offline needs. If data can’t leave, open-weights served locally is your bracket, and the comparison happens inside it.
The strategy: tier, don’t marry
Mature agentic systems are polyglot, and yours should plan to be:
| Tier | Job | What to use |
|---|---|---|
| Planner | Decompose goals, recover from surprises, synthesize | A frontier model (Claude, GPT, Gemini class) |
| Worker | Routine bounded steps: extract, classify, summarize, draft | A mid-tier or small hosted model, or a strong open-weights model |
| Router/guard | Pick a category, validate a format, gate an input | A small fast model — increasingly, a local one |
Two mechanics make this cheap: a gateway so routing is config rather than code, and per-tier golden tasks so promotion and demotion between tiers is a measurement, not a debate.
The eval-driven decision, in five steps
- Write ~20 golden tasks from your workload — including tool-failure and should-not-act cases.
- Run each candidate N times per task (nondeterminism is real); record pass rate, tokens, latency, and cost per completed trajectory — cost per token flatters verbose models.
- Kill candidates that fail your constraint bracket (residency, latency ceiling) before admiring their scores.
- Pick per tier, not overall. The best planner is rarely the best router.
- Re-run quarterly and on every major model release. The method is stable; the winners aren’t — that’s the point of owning evals instead of opinions.
The traps
- Benchmark chasing: switching planners for a two-point leaderboard delta while your tools lack validation is optimizing the wrong layer.
- Cost-per-token myopia: a cheaper model that takes three extra recovery turns is a more expensive model.
- Single-vendor identity: “we’re a
<vendor>shop” is a procurement statement, not an architecture. Keep the harness provider-agnostic and the identity costs you nothing when it changes.
Was this guide useful?
Thanks — noted. It shapes what gets written next.
newsletter
One practical agentic-AI guide in your inbox. No news, no hype.
Tutorials and decision frameworks as they ship. Unsubscribe anytime.