FinOps for AI agents: metering token spend before it meters you
An architecture for agent cost governance: OpenTelemetry GenAI conventions for instrumentation, a gateway for enforcement, and the reporting dimensions finance will actually ask for.
Agent costs don’t grow like software costs. They grow like usage × verbosity × retries × context length — four multipliers, each of which an engineer can change with one innocent commit. If you’re serious about agents in production, cost telemetry is part of the architecture, not a finance afterthought. Here is the reference shape.
The three-layer architecture
flowchart LR
A["Agents & apps"] --> G["LLM gateway<br/>(routing · budgets · keys)"]
G --> P["Model providers<br/>(hosted APIs / local serving)"]
A -. "OTel GenAI spans" .-> O["Observability store<br/>(traces · dashboards · alerts)"]
G -. "usage & cost records" .-> O
Layer 1 — Instrumentation (standards, not vendors). Emit every model call as a span following the OpenTelemetry GenAI semantic conventions — model name, token counts in/out, and the operation it served. Standardizing on OTel keeps you portable across observability vendors and lets one dashboard cover ten teams’ agents.
Layer 2 — Gateway (the enforcement point). Route all model traffic through one gateway (LiteLLM is the common open-source choice; several commercial equivalents exist). The gateway is where cost controls live, because it’s the only place that sees every call: per-team API keys, budgets with hard caps, model allowlists, and automatic fallback routing. Instrumentation without an enforcement point is a dashboard of regrets.
Layer 3 — Attribution (the FinOps part). Tag every call with the dimensions finance will ask for — team, application, environment, and task type. Cost-per-call is trivia; cost-per-outcome (per resolved ticket, per document processed, per PR merged) is the number that decides whether the agent survives budget season.
The four agent-specific cost patterns
- The context ratchet. Conversations and memories grow; input tokens dominate spend. Watch input:output ratio per task type — a rising ratio means context needs pruning, not a bigger budget.
- The retry spiral. A failing tool or a strict validator can put an agent in a paid loop. Cap attempts per goal in the harness and alert on outlier trajectories (cost per trajectory > p99).
- The model-mismatch tax. Frontier models doing classification is the agent era’s “dev database on production hardware.” Route by task tier at the gateway; re-evaluate quarterly as prices move.
- The silent multiplier. Multi-agent systems re-send overlapping context on every hop. Meter per-trajectory, not per-call, or orchestration overhead hides inside averages.
Minimum viable setup, in order
- Gateway in front of all model traffic (an afternoon), with per-team keys.
- OTel GenAI spans from the gateway into the observability stack you already run.
- One dashboard: spend by team × model × task type, plus trajectory-cost outliers.
- Hard budget caps per environment — enforced at the gateway, alerting at 80%, blocking at 100% in non-production.
- Cost-per-outcome metrics, defined with the business owner, reported monthly.
Was this guide useful?
Thanks — noted. It shapes what gets written next.
newsletter
One practical agentic-AI guide in your inbox. No news, no hype.
Tutorials and decision frameworks as they ship. Unsubscribe anytime.