How to choose an LLM for your agent: Claude, GPT, Gemini, or open-weights

The wrong question is “which model is best?” — the leaderboard answer changes quarterly and mostly measures things agents don’t do. The right question is “which model is best at each job inside my system?” That one has a stable method, even as the names shift.

Why general benchmarks mislead for agents

An agent asks a model to do three unusual things: emit precisely formatted tool calls over long trajectories, notice and recover when a tool fails, and know when to stop. General chat benchmarks measure none of these. Two models with near-identical benchmark scores can differ wildly in multi-step tool reliability — and that difference, compounded over a 15-step trajectory, is the whole ballgame: 98% per-step reliability is ~74% per-trajectory; 90% is ~21%.

The six criteria that actually matter

Tool-calling reliability — valid calls, correct arguments, sustained over many turns. The dominant factor.
Error recovery — shown a failure, does it adapt or repeat itself?
Instruction persistence — does the system prompt still bind on turn 30, or has it drifted?
Long-context behavior — not the advertised window; the quality of attention across a full, tool-result-stuffed context.
Latency & cost shape — agents multiply per-call latency by every step; a slow frontier model can lose to a fast good-enough one.
Deployment constraints — data residency, privacy, offline needs. If data can’t leave, open-weights served locally is your bracket, and the comparison happens inside it.

The strategy: tier, don’t marry

Mature agentic systems are polyglot, and yours should plan to be:

Tier	Job	What to use
Planner	Decompose goals, recover from surprises, synthesize	A frontier model (Claude, GPT, Gemini class)
Worker	Routine bounded steps: extract, classify, summarize, draft	A mid-tier or small hosted model, or a strong open-weights model
Router/guard	Pick a category, validate a format, gate an input	A small fast model — increasingly, a local one

Two mechanics make this cheap: a gateway so routing is config rather than code, and per-tier golden tasks so promotion and demotion between tiers is a measurement, not a debate.

The eval-driven decision, in five steps

Write ~20 golden tasks from your workload — including tool-failure and should-not-act cases.
Run each candidate N times per task (nondeterminism is real); record pass rate, tokens, latency, and cost per completed trajectory — cost per token flatters verbose models.
Kill candidates that fail your constraint bracket (residency, latency ceiling) before admiring their scores.
Pick per tier, not overall. The best planner is rarely the best router.
Re-run quarterly and on every major model release. The method is stable; the winners aren’t — that’s the point of owning evals instead of opinions.

The traps

Benchmark chasing: switching planners for a two-point leaderboard delta while your tools lack validation is optimizing the wrong layer.
Cost-per-token myopia: a cheaper model that takes three extra recovery turns is a more expensive model.
Single-vendor identity: “we’re a <vendor> shop” is a procurement statement, not an architecture. Keep the harness provider-agnostic and the identity costs you nothing when it changes.