Token Prices Fell 99%. Your AI Bill Didn't.

By Framesta Fernando · Engineering Manager & Technical Architect · 11 min read · Published June 24, 2026

TL;DR: The headline says token prices fell 99% in three years, but your effective token cost barely moved in 2026. Per-token price is not your bill. Tokens-per-request and request volume both grow faster than price falls, so the cheaper models get, the more you spend. Track effective cost, not the price sheet.

Every pricing post this year leads with the same triumphant chart: Gemini-class flash models at $0.10 per million input tokens, a 99% drop in three years. Then the same teams quoting that chart open their invoices and find spend up, not down. Both things are true at once, and the gap between them is where AI budgets go to die.

The price of a token has genuinely collapsed. Your bill did not, because the price of a token was never what you were paying. You were paying price times tokens-per-request times request volume, and while the first term fell, the other two grew faster.

The number behind the number

The mental model that misleads everyone is that a falling unit price means a falling bill. It feels like basic arithmetic: cheaper tokens, smaller invoice. Finance plans around it, engineering promises it, and then neither happens.

Look at the effective rate instead of the sticker. YipitData, tracking close to 2 quadrillion annualized tokens across API services, found that effective token pricing fell only about 6% through May of 2026, despite the headline per-model cuts (YipitData, LLM pricing trends). The 99% figure is real for a fixed model held still over three years (the LLM pricing collapse). The 6% is what actually happened to real spending, because buyers do not hold anything still. They trade every price cut for more capability: bigger context, reasoning tokens, more calls per task.

This is the oldest pattern in computing wearing a 2026 costume. When a resource gets cheaper, consumption of it rises to consume the savings and then some. Cheaper tokens do not shrink the bill, they remove the friction that was keeping usage small.

Where the savings evaporate

Put the arithmetic on a single request and the illusion falls apart. The following is illustrative, using realistic current rates. Say last year you ran a frontier model at $10 per million input tokens, sending a lean 2,000-token prompt, at 500,000 requests a month.

Last year:
  $10/1M x 2,000 tokens x 500,000 req
  = $10 x 0.002 x 500,000
  = $10,000 / month

This year the comparable-capability model dropped to $3 per million, a 70% price cut. You celebrate. But you also switched on reasoning, grew the context to include retrieved docs and history, so prompts are now 12,000 tokens, and the feature got popular, so volume doubled to 1,000,000 requests.

This year:
  $3/1M x 12,000 tokens x 1,000,000 req
  = $3 x 0.012 x 1,000,000
  = $36,000 / month

The price fell 70% and the bill rose 3.6x. Nothing here is irrational. Each decision (better model, more context, more usage) was individually correct. But the savings from the price cut were spent six times over by the growth in tokens-per-request and volume, both of which the price cut directly encouraged. The output token side is worse, since reasoning models emit far more output tokens than the old single-pass models, and output is billed at a multiple of input (per CloudZero's 2026 pricing comparison). I traced the bill-up-while-price-down dynamic in detail in how token prices dropped 99% while bills rose 320%.

The price sheet is a decoy

Here is the reframe that matters: the per-token price is the one variable in your cost equation you do not control and that trends to zero on its own. The two you do control, tokens-per-request and volume, are the ones nobody on the team is watching, and they are the ones setting your bill.

Optimizing the price you pay per token is chasing the term that is already falling fastest without your help. The frontier providers will keep cutting it. Meanwhile the prompt you send grew from 2,000 to 12,000 tokens because adding context was easy and felt free, and no one owns the number. That is the actual lever, and it is sitting unattended while the team congratulates itself on picking a cheaper model.

Think of it as cost = rate × shape × demand. Rate is the price sheet, falling on autopilot. Shape is how many tokens each request burns, set by your prompt and context design. Demand is how often you call. A team that obsesses over rate and ignores shape is optimizing the one term trending to zero and neglecting the two that compound. The cheap-model temptation makes it worse, because cheaper rates invite sloppier shapes, a trap I unpacked in the real cost of cheap AI models in production.

What to actually instrument

Stop quoting the price sheet and start tracking the terms you own:

Term Who sets it Trend without action
Rate (price per token) The provider Falls on its own
Shape (tokens per request) Your prompt/context Grows quietly
Demand (requests/month) Product success Grows with adoption

The instrumentation that controls the bill is a single derived metric: effective cost per completed task, computed from your own usage, not the vendor's rate card.

// Effective cost per task: the only number that predicts your bill.
const costPerTask = (
  inputTokens: number,
  outputTokens: number,
  inRate: number,   // $ per 1M input tokens
  outRate: number,  // $ per 1M output tokens
) => (inputTokens / 1e6) * inRate + (outputTokens / 1e6) * outRate;

// A 70% rate cut is erased by a 4x token-shape increase.
// Watch costPerTask over time, not the provider's price sheet.

Track costPerTask per release. When it rises, the cause is almost always shape (a prompt that grew, a reasoning mode left on, context you stopped trimming), not rate. Caching attacks the rate side of the repeated portion and is the highest-impact first move, which is why I put it ahead of everything else in the prompt caching bill cut most teams never turn on. Context discipline attacks the shape side, covered in the real cost of a massive context window. The pattern to watch for is a release that ships a quality win and a silent shape regression in the same commit: the demo looks sharper, the cost per task climbs 30%, and nobody connects the two until the invoice lands a month later.

Decision guidance

The trap is letting a price cut justify a usage increase nobody measured. A cheaper model is permission to ship more, not a guarantee of a smaller bill, and the two get confused in every planning meeting.

The rule: If you are about to attribute a budget plan to falling token prices, then stop and compute effective cost per task on your real traffic first, because the price cut is already priced in and the shape growth is not.

Treat the rate as a tailwind you did not earn and cannot bank. Budget on tokens-per-task and volume, the terms you actually move. When someone proposes adding 6,000 tokens of context "since tokens are cheap now," that sentence is the bug. Cheap per token, expensive per task, multiplied by a volume that is also rising.

The chart that lies by telling the truth

The 99% price-drop chart is not wrong. It is just answering a question nobody's bill is asking. Your invoice does not pay the 2023-to-2026 price of a frozen model. It pays this month's rate times this month's bloated prompts times this month's traffic, and two of those three are climbing.

Cheaper tokens were never going to save you money. They were always going to let you spend more while feeling thrifty.

FAQ

Q: If token prices dropped 99%, why is my AI bill higher?

Because price is one of three terms in your cost, and the other two grew faster. Effective token pricing fell only about 6% in 2026 even as headline per-model prices collapsed, because teams spend each cut on bigger context, reasoning tokens, and more requests. Your bill is rate times tokens-per-request times volume, and the last two compound while rate falls.

Q: What is effective token cost versus headline price?

Headline price is the rate card for a specific model, which falls dramatically over time. Effective token cost is what you actually pay per request or per task on your real traffic, factoring in how many tokens each call burns and how often you call. The headline can drop 70% while your effective cost per task rises, because prompts grew and volume increased.

Q: How do I actually reduce my AI bill in 2026?

Attack the terms you control. Enable prompt caching to cut the rate on repeated prefixes, trim context to reduce tokens per request, and turn off reasoning modes where they add no quality. Track effective cost per completed task over time. Chasing a cheaper model alone usually fails, because the savings get spent on more usage the cheaper price invites.

Q: Why do reasoning models make the bill worse?

Reasoning models emit many more output tokens than single-pass models, and output is billed at a multiple of input, often three to five times the input rate. So even at a lower headline input price, a reasoning model can cost more per task because it generates far more billable output. Measure cost per task with reasoning on versus off before defaulting it on.

Q: Is switching to a cheaper model a waste of time?

Not a waste, but rarely the biggest lever. The rate is already falling on its own, so a model switch captures a one-time gain that ongoing token-shape growth can erase within a quarter. Pair any model switch with caching and context discipline, and track effective cost per task, or the cheaper model just becomes a license to send bigger prompts.

Q: What single metric should I track for AI cost?

Effective cost per completed task, computed from your own input and output token counts times current rates. It captures all three cost terms in one number and exposes shape growth that the provider price sheet hides. When it rises release over release, the cause is almost always a prompt that grew or a mode left on, not the vendor changing prices.

Next Read

The fastest way to cut the rate side of the bill is the one most teams skip: see the prompt caching bill cut most teams never turn on.


Sources & Further Reading


Last updated: June 24, 2026