Back to Blog
AI Strategy

Where AI Cost Actually Goes in 2026 (and What to Do About It)

12 May 20269 min read

Every few weeks I get the same call. An enterprise has moved an AI pilot into production and the bill is two or three times what they expected. The model vendor's pricing page suggested it should be cheap. The proof of concept ran on a few hundred dollars a month. Now they're looking at five figures and trying to work out where it went.

What's odd about this is that per-token prices have fallen over the last 18 months. The cost spike isn't coming from the headline numbers on the pricing page. It's coming from architecture choices that don't show up there.

The real cost categories

When I audit an AI workload, the bill usually breaks into six buckets. Most teams underestimate at least three of them.

1. Context — the silent multiplier

The biggest surprise for most teams: input tokens dominate. A typical enterprise agent sends along a system prompt, a knowledge base excerpt, a chat history, and sometimes tool definitions. That's often 8,000 to 20,000 tokens of input for an output that might be 200 tokens.

If your provider charges $3 per million input tokens and $15 per million output, the input alone can be 80% of the per-call cost — and it gets repeated on every turn of a conversation. Caching helps enormously but most teams don't enable it.

2. Agentic loops

An agent that calls tools doesn't make one model call per user request. It often makes five, ten, or twenty. Every tool call sends the full context plus the new tool result back into the model. A single agentic interaction can quietly consume 100,000 tokens before you've noticed.

This is the category I see most underestimated. The pricing math people do in the design phase assumes one call per request. Production reality is rarely that clean.

3. Retrieval infrastructure

Vector databases, embedding APIs, the storage and re-indexing costs of keeping a knowledge base fresh. Not huge per query but it adds up at scale, and it's recurring infrastructure cost rather than per-call.

4. Observability and evaluation

Anything you run in production needs logging, tracing, and ongoing evaluation. Tools like LangSmith, Helicone, or homegrown equivalents have their own cost. The eval datasets you run on every model change cost real tokens — sometimes more than production traffic during active development.

5. Model redundancy

Most serious deployments now use more than one model. A cheap model handles routing or first-pass triage, a flagship handles the complex cases. Some teams add a third for verification. Each one adds latency and cost, but it usually still beats running everything through the flagship.

6. The human layer

Often missed in cost models: the people who curate prompts, review outputs, tune evaluations, and respond to escalations. For any AI system used by employees or customers, this is a real operating cost. It's labour, not cloud, but it's the difference between a system that improves and one that drifts.

The patterns that bring cost down

Prompt caching

If your provider supports it — Anthropic, OpenAI, and Google all do now — turn it on. The savings on repeated system prompts and knowledge base context are typically 60–90%. It's the easiest five-figure saving available to most teams in 2026 and most teams haven't taken it.

Model routing

Don't send every query to your most expensive model. Build a simple classifier — often a small language model itself — that decides whether a query needs the flagship or whether a smaller model will do. For a procurement assistant I worked on recently, routing cut total inference cost by about 70% with no measurable quality drop.

Output budgeting

Set max_tokens aggressively. If your use case never needs more than 500 tokens of output, capping it prevents runaway generations from costing more than the value they provide. Obvious in theory, rare in production.

Batch where you can

For offline workloads — summarisation, classification, data enrichment — most providers offer batch APIs at half price or better. Anything that doesn't need a real-time response should go through the batch path.

Measure tokens like you measure latency

Every production AI system needs token usage as a first-class metric, broken out by feature, user segment, and model. Without this, cost optimisation is guesswork. With it, the high-cost paths become obvious in days.

What the organisations doing this well have in common

The enterprises that have AI costs under control share a few traits. They treat AI spend like cloud spend — with FinOps practices, monthly reviews, and ownership at the team level. They've usually built a thin internal abstraction over their model providers so they can route, cache, and switch without rewriting application code. And they have someone whose job includes watching the bill and asking why it moved.

The ones that don't have it under control treat AI cost as a vendor problem. They wait for the next price drop and hope for the best. The next price drop is usually coming, but it rarely fixes the architecture issue underneath.

Where to start

If you're worried about your own bill, the first useful thing to do is get visibility. Tag every model call with the feature that triggered it, the user segment, and the token counts in and out. A week of that data usually points at one or two paths consuming most of the budget. Fix those first. The rest can wait.

If you'd like a second pair of eyes on where your AI spend is actually going, get in touch.

AI CostFinOpsEnterprise AIProduction