FinOps for AI agents: metering token spend before it meters you

Agent costs don’t grow like software costs. They grow like usage × verbosity × retries × context length — four multipliers, each of which an engineer can change with one innocent commit. If you’re serious about agents in production, cost telemetry is part of the architecture, not a finance afterthought. Here is the reference shape.

The three-layer architecture

flowchart LR
    A["Agents & apps"] --> G["LLM gateway<br/>(routing · budgets · keys)"]
    G --> P["Model providers<br/>(hosted APIs / local serving)"]
    A -. "OTel GenAI spans" .-> O["Observability store<br/>(traces · dashboards · alerts)"]
    G -. "usage & cost records" .-> O

Layer 1 — Instrumentation (standards, not vendors). Emit every model call as a span following the OpenTelemetry GenAI semantic conventions — model name, token counts in/out, and the operation it served. Standardizing on OTel keeps you portable across observability vendors and lets one dashboard cover ten teams’ agents.

Layer 2 — Gateway (the enforcement point). Route all model traffic through one gateway (LiteLLM is the common open-source choice; several commercial equivalents exist). The gateway is where cost controls live, because it’s the only place that sees every call: per-team API keys, budgets with hard caps, model allowlists, and automatic fallback routing. Instrumentation without an enforcement point is a dashboard of regrets.

Layer 3 — Attribution (the FinOps part). Tag every call with the dimensions finance will ask for — team, application, environment, and task type. Cost-per-call is trivia; cost-per-outcome (per resolved ticket, per document processed, per PR merged) is the number that decides whether the agent survives budget season.

The four agent-specific cost patterns

The context ratchet. Conversations and memories grow; input tokens dominate spend. Watch input:output ratio per task type — a rising ratio means context needs pruning, not a bigger budget.
The retry spiral. A failing tool or a strict validator can put an agent in a paid loop. Cap attempts per goal in the harness and alert on outlier trajectories (cost per trajectory > p99).
The model-mismatch tax. Frontier models doing classification is the agent era’s “dev database on production hardware.” Route by task tier at the gateway; re-evaluate quarterly as prices move.
The silent multiplier. Multi-agent systems re-send overlapping context on every hop. Meter per-trajectory, not per-call, or orchestration overhead hides inside averages.

Minimum viable setup, in order

Gateway in front of all model traffic (an afternoon), with per-team keys.
OTel GenAI spans from the gateway into the observability stack you already run.
One dashboard: spend by team × model × task type, plus trajectory-cost outliers.
Hard budget caps per environment — enforced at the gateway, alerting at 80%, blocking at 100% in non-production.
Cost-per-outcome metrics, defined with the business owner, reported monthly.