12 min read B2C power user

Understanding AI Token Pricing: What You Actually Pay Per Query

How AI token pricing really works in 2026: input vs output, cached reads, reasoning tokens, vision, and what a single query actually costs you.

Understanding AI Token Pricing: What You Actually Pay Per Query

You ask Claude a question. You get an answer. Somewhere in the background, a meter ticks. If you pay through a subscription, that meter is hidden behind a “fair use” curtain. If you pay through the API, the meter is the bill. Either way, the unit of currency is the token, and almost nobody who uses these tools day to day actually understands what a token costs them per query.

This guide fixes that. We will walk through what a token is, why input and output cost different amounts, how prompt caching, reasoning tokens, and vision inputs change the math, and what a single real-world query actually costs across the major providers in 2026. By the end you will be able to look at any chat or API call and estimate its price within a cent.

What is a token, really

A token is roughly three quarters of an English word. The exact ratio depends on the tokenizer, but the rule of thumb that holds across Claude, GPT, and Gemini is:

  • 1 token is about 4 characters of English text.
  • 100 tokens is about 75 words.
  • 1,000 tokens is about 750 words, or one and a half pages of single-spaced text.

Code, numbers, and non-English languages tokenize less efficiently. A Python function with lots of underscores and short variable names can run 20 to 30 percent more tokens than the equivalent prose. Japanese or Chinese text often runs 2x to 3x more tokens per visible character than English.

Why does the unit matter? Because every model bills by tokens, not words, not characters, not queries. When a provider quotes “$3 per million tokens”, they are quoting the price of pushing about 750,000 English words through their pipes. That sounds cheap until you remember that a single coding session can burn through 50,000 tokens in twenty minutes.

Input tokens versus output tokens

The first thing that confuses newcomers: input and output are not priced the same. Output tokens cost more, often 3x to 5x more, than input tokens.

Why the asymmetry? Generation is sequential. To produce 1,000 output tokens, the model runs 1,000 forward passes, one for each token, each one re-attending over the entire context so far. Reading the input is also expensive, but it happens once. Output is where the GPU sweats.

Approximate 2026 list prices per million tokens, rounded for arithmetic:

  • Claude Sonnet (mid-tier): $3 input, $15 output.
  • Claude Opus (flagship): $15 input, $75 output.
  • GPT-5 (flagship): $1.25 input, $10 output.
  • GPT-5 mini: $0.25 input, $2 output.
  • Gemini 2.5 Pro: $1.25 input, $10 output (under 200K context window).
  • Gemini 2.5 Flash: $0.30 input, $2.50 output.

Two things jump out. First, the spread between cheap and expensive models is roughly 60x at the output level. Second, output is always more painful than input for the same model. Anything you can do to make answers shorter saves real money.

A practical consequence: if you are running a workflow where you stuff in a 50,000-token document and ask for a 200-token summary, you pay almost entirely for the input. If you are generating a 5,000-token essay from a 100-token prompt, you pay almost entirely for the output. The shape of the query, not just the size, determines the bill.

What a single query actually costs

Let us walk through five realistic queries and compute the actual price on each model. We will round to the nearest tenth of a cent.

Query 1: “Explain quantum entanglement in two paragraphs”

  • Input: about 10 tokens.
  • Output: about 300 tokens (two short paragraphs).
ModelCost
GPT-5 mini$0.0006
Gemini 2.5 Flash$0.0008
GPT-5$0.003
Claude Sonnet$0.0045
Claude Opus$0.023

Two tenths of a cent on the cheap models, just over two cents on the flagship. The 40x spread for a question this simple is the whole reason people care about routing.

Query 2: Reviewing a 5,000-line code file for bugs

  • Input: about 45,000 tokens (the file plus the system prompt and instructions).
  • Output: about 800 tokens (a list of findings).
ModelCost
GPT-5 mini$0.013
Gemini 2.5 Flash$0.016
GPT-5$0.064
Claude Sonnet$0.147
Claude Opus$0.735

Now we are talking real money per query. Running this review 10 times a day on Opus is $7.35 per day, $220 per month, just for one task. This is exactly where heavy users start questioning whether the flagship is worth it.

Query 3: Summarizing a 200-page PDF

  • Input: about 180,000 tokens.
  • Output: about 1,500 tokens.
ModelCost
Gemini 2.5 Flash$0.058
GPT-5 mini$0.048
GPT-5$0.240
Claude Sonnet$0.563
Claude Opus$2.812

A 200-page PDF on Opus costs almost three dollars per pass. Most workflows pass the same document through the model multiple times during refinement. The math gets ugly fast.

Query 4: A short chat reply (“rephrase this email to sound less aggressive”)

  • Input: about 200 tokens (the email plus the ask).
  • Output: about 200 tokens (the rewritten email).
ModelCost
GPT-5 mini$0.0005
Gemini 2.5 Flash$0.0006
GPT-5$0.0023
Claude Sonnet$0.0036
Claude Opus$0.018

Negligible. Most chat-style use stays in this zone, which is why subscription plans can profitably let you ramble all day. The provider is betting most of your queries are query 1 or query 4, not query 2 or 3.

Query 5: Generating a 4,000-word blog post from a brief

  • Input: about 500 tokens.
  • Output: about 5,500 tokens.
ModelCost
GPT-5 mini$0.011
Gemini 2.5 Flash$0.014
GPT-5$0.056
Claude Sonnet$0.084
Claude Opus$0.420

Notice how here the output dominates. Doubling the output length doubles the cost. Halving it almost halves the cost. This is the lever that bulk content workflows pull first.

Prompt caching: the hidden discount

In 2024 all three providers shipped some flavor of prompt caching. The idea: if you keep sending the same long system prompt, document, or codebase to the model over and over, the provider will store the intermediate state and let you re-read it at a steep discount.

The numbers in 2026 look roughly like this:

  • Cached input on Claude: 10 percent of normal input price. So Sonnet drops from $3 to $0.30 per million cached tokens.
  • Cached input on GPT-5: 25 percent of normal input price.
  • Cached input on Gemini: 25 percent of normal input price, but only for cache reads after the first 1,024 tokens.

There is usually a small write-time fee the first time you populate the cache, but it amortizes after two or three reads. For agentic workflows that re-send the same 50K-token system context to the model dozens of times per session, prompt caching can cut bills by 70 to 90 percent. If you are not using it, you are leaving money on the table.

The catch: caches expire. Claude caches live for 5 minutes by default, with an option for 1 hour at a higher write price. OpenAI’s cache is “best effort” with a shorter lifetime. Build your workflow to keep cached prefixes warm, or pay full price.

Reasoning tokens, the invisible meter

Starting in late 2024 with OpenAI’s o1 and continuing through Claude’s extended thinking and Gemini’s “deep think” modes, models began billing for internal reasoning that you never see. The model thinks for thousands of tokens before producing the visible answer. Those thinking tokens are billed at output rates.

In 2026, on a hard reasoning query (math proof, complex coding task, multi-hop research), reasoning tokens commonly outnumber visible output by 5x to 20x. A reply that shows you 500 tokens of answer might have burned 7,000 tokens of hidden reasoning. At Opus output rates of $75 per million, that is $0.56 of invisible tokens behind a 500-token visible reply.

If you use the reasoning modes, you must track the reasoning_tokens field in the API response. The usage block reports it separately from output_tokens. Tools like tokenkarma surface this number alongside visible output so you can see when your bill is being driven by thinking you cannot read.

The flip side: for non-reasoning tasks (writing, simple lookup, translation), disable extended thinking. It is pure cost with no benefit.

Vision and audio tokens

Images and audio do not flow into the model as pixels or waveforms. They are encoded into tokens, just at a different conversion rate.

Approximate 2026 conversions:

  • Image at standard resolution: 250 to 1,200 input tokens per image, depending on size and provider.
  • High-detail image: 1,500 to 3,000 input tokens per image.
  • Audio: about 32 tokens per second on most providers, so 1 minute of speech is about 1,900 tokens.

A query like “describe what is in this photo” looks tiny in the chat UI but is actually pushing 1,000 to 3,000 input tokens through the model. On Claude Sonnet, a single high-detail image costs about a cent of input alone, before the model says a word.

For video, providers either sample frames (Gemini) or require pre-processing. A 1-minute video clip on Gemini 2.5 Pro runs about 17,000 tokens. A multi-minute video Q&A session can rack up serious tokens.

Where API math meets subscription math

Most people do not pay token-by-token. They pay a flat monthly fee for ChatGPT Plus, Claude Pro, or Gemini Advanced. So why does any of this matter?

Because the subscription is just a token bucket with a paywall on top. Claude Pro in 2026 gives you roughly enough tokens per 5-hour window to do about 45 to 200 messages, depending on model and context size. ChatGPT Plus gives you a similar bucket on GPT-5. The plans are sized so that average users never feel the limit and heavy users hit it constantly.

The math the providers do: average user generates 20 short queries per day on a cheap model. Subscription gross margin is 80 percent. Heavy user generates 500 long queries per day on the flagship. Subscription is loss-making for them by a factor of 3 to 5. The provider tolerates the loss because the average pays for the heavy.

What this means for you: if your token bill, computed at API prices, exceeds your subscription price by 2x or more, you are profitable for the provider as a free-rider. If it exceeds it by 10x or more, you are the user the rate limits are designed to clip. Knowing where you sit tells you whether to lean harder on the subscription, switch to API billing, or split your work across providers.

This is the part most cost-tracking guides skip. The token math is not just about API spend. It is about understanding which side of the subscription’s break-even line you are on, and whether the next nerf is aimed at you.

How to estimate any query in your head

Once you have done the math a few times, you can eyeball most queries to within a cent. Here is the cheat sheet:

  1. Estimate the input in words, divide by 0.75, round to the nearest thousand tokens.
  2. Estimate the output the same way.
  3. Multiply input tokens by the input price per million, divide by 1,000,000. Same for output.
  4. If reasoning is on, multiply output by 5 to 10 for a rough total.
  5. If caching applies, divide input by 4 (GPT-5) or 10 (Claude).

For a Claude Sonnet query with 4,000 input tokens and 800 output tokens with no reasoning: (4,000 / 1M) x $3 plus (800 / 1M) x $15 equals $0.012 plus $0.012 equals $0.024. About two and a half cents. The estimate is usually within 20 percent of the actual bill.

Tracking the meter

Eyeballing is good enough for casual use. For anything serious (running agents, building products, hitting API limits), you need actual instrumentation. Three things to log on every call:

  • input_tokens, output_tokens, cached_input_tokens, and reasoning_tokens separately.
  • The model name and timestamp, so you can correlate cost spikes to behavior changes.
  • The downstream task tag, so you can answer “which feature in my app burned the budget this month”.

Most providers give you the first piece in the response. The second two are on you. This is the gap tokenkarma fills: unified per-call accounting across Claude, OpenAI, and Gemini, with the cached and reasoning fields broken out so you can see what is actually driving the bill rather than guessing from monthly summaries.

The point is not to obsess over every cent. The point is to know the cost of a query before you build a workflow that runs it ten thousand times.

The takeaway

Token pricing is not complicated. It is just under-explained. Three things to remember:

  • Output costs 3x to 5x more than input. Shorter answers are cheaper, full stop.
  • Reasoning tokens and image tokens add to the bill silently. Track them or get surprised.
  • Prompt caching cuts repeat-prefix workflows by 70 to 90 percent. Use it.

Do the math once on the queries you actually run. The number will be smaller than you fear for chat, and larger than you fear for agents. Either way, knowing the number is the difference between a tool you control and a meter you watch in horror at the end of the month.