When to Use Claude vs ChatGPT vs Gemini: A Cost-Per-Task Guide
A practical 2026 cost-per-task breakdown of Claude, ChatGPT and Gemini: which model wins on coding, writing, research, vision, and bulk jobs.
Picking between Claude, ChatGPT and Gemini used to be a taste question. In 2026 it is a cost question. The frontier models from Anthropic, OpenAI and Google are close enough in raw capability that the smarter operators no longer ask “which one is best”, they ask “which one is best for this specific task at this specific price”.
This guide walks through the math task by task. No vendor cheerleading, no benchmark gymnastics. Just the numbers you need to decide which model touches which job.
The three families, in plain terms
Each provider ships a tiered lineup. The naming has gotten messy, so here is the only mental model you need.
- Flagship: maximum capability, highest cost. Claude Opus, GPT-5, Gemini 2.5 Pro.
- Workhorse: 80 to 90 percent of the flagship at a fraction of the cost. Claude Sonnet, GPT-5 mini, Gemini 2.5 Flash.
- Cheap and fast: optimized for volume. Claude Haiku, GPT-5 nano, Gemini 2.5 Flash-Lite.
Roughly accurate public API pricing in 2026, per million tokens:
| Tier | Anthropic | OpenAI | |
|---|---|---|---|
| Flagship in/out | $15 / $75 | $10 / $40 | $1.25 (under 200k) / $10 |
| Workhorse in/out | $3 / $15 | $0.25 / $2 | $0.30 / $2.50 |
| Cheap in/out | $0.80 / $4 | $0.05 / $0.40 | $0.10 / $0.40 |
Two things jump out. First, the spread between flagship and cheap is 50x to 200x at the same provider. Second, Gemini’s workhorse and cheap tiers are aggressively underpriced relative to Anthropic. That is going to matter.
How to think about cost per task, not cost per token
A “task” in this guide is one complete unit of work. A pull request review. A blog draft. A research dossier. A customer support reply. The cost is what you actually spend to finish it, including retries, tool calls, and context bloat.
Three numbers drive cost per task:
- Input tokens, dominated by context. A 50-page PDF plus instructions is roughly 30,000 input tokens.
- Output tokens, dominated by reasoning. A frontier model thinking out loud can output 5,000 to 20,000 tokens for a hard task.
- Number of turns. A one-shot task is one round. An agentic task can be 20.
Output tokens are usually where the bill explodes. They cost 4x to 8x more than input across every provider, and reasoning models in 2026 generate far more of them than the chat models of 2024 did.
Task 1: Coding
Coding is the highest-stakes head-to-head. If your model picks the wrong file or hallucinates an API, you waste hours, not cents.
The 2026 ranking, on real-world refactor and bugfix benchmarks (SWE-bench Verified, internal team replays), looks like this:
- Claude Opus / Sonnet: best at multi-file refactors, large codebase navigation, and following dense style guides. Sonnet specifically remains the price-to-performance leader.
- GPT-5: best at unfamiliar algorithms, math-heavy code, and tight greenfield problems. Slightly more verbose than Claude.
- Gemini 2.5 Pro: best when the codebase is enormous and you need the full 1M+ context window. Weaker than Claude on agentic edits.
Cost per coding task, rough averages:
| Task | Best fit | Typical cost |
|---|---|---|
| Quick bugfix, one file | GPT-5 mini or Sonnet | $0.02 to $0.10 |
| Multi-file refactor | Claude Sonnet | $0.15 to $0.60 |
| Big architectural change, 30+ files | Claude Opus or Gemini 2.5 Pro | $1 to $4 |
| Repo-wide audit, 200k+ tokens of context | Gemini 2.5 Pro | $0.30 to $1.50 |
If you live in Claude Code or Cursor all day, Sonnet is the default for a reason. It clears 80 percent of real work at one fifth the cost of Opus, and Opus only earns its price tag when the diff has to be right the first time.
Task 2: Long-form writing
Drafting articles, reports, and pitch decks.
- Claude writes with the most natural rhythm. Less repetition, fewer transition cliches, less “in conclusion” filler.
- ChatGPT is the most flexible across formats but skews toward listicle structure unless you push hard against it.
- Gemini is the most factual on news-adjacent topics because of search grounding, but its prose is the driest of the three.
Cost per 1,500-word draft, including 2 to 3 revision turns:
| Provider | Tier | Typical cost |
|---|---|---|
| Claude | Sonnet | $0.10 to $0.20 |
| OpenAI | GPT-5 mini | $0.02 to $0.05 |
| Gemini 2.5 Flash | $0.01 to $0.03 |
If you write a lot and care about voice, Sonnet wins. If you write a lot and care about throughput, Gemini Flash is the rational choice. It is not as elegant, but at 1 to 3 cents per draft you can run five candidates and keep the best.
Task 3: Research and synthesis
Reading 20 PDFs and producing a 3-page brief.
Two factors decide this:
- How big is the corpus, in tokens?
- Do you need citations to specific source pages, or just a synthesis?
For corpora under 200k tokens, all three families handle it. For corpora above 500k tokens, Gemini 2.5 Pro is the only one that can keep the whole thing in context. Claude tops out at 200k for most accounts (1M is available but rarely allocated by default), and OpenAI’s frontier sits between 256k and 400k depending on the variant.
If you need real source citations, ground the call in a retrieval layer rather than trusting the model. All three providers will confidently misattribute under load.
Typical cost for a 60-document research brief:
| Approach | Cost |
|---|---|
| Gemini 2.5 Pro, single big-context call | $0.40 to $1.20 |
| Claude Sonnet with a retrieval shim | $0.20 to $0.80 |
| GPT-5 with chunked summarization | $0.30 to $1.00 |
For one-off research, the Gemini approach is the simplest and usually the cheapest. For repeated research at scale, build a retrieval layer once and use Sonnet or Flash for the synthesis step.
Task 4: Vision and document parsing
Reading invoices, screenshots, charts, handwritten notes.
- Claude is the most accurate on dense documents, tables, and structured forms.
- Gemini is the most accurate on natural images, charts, and video frames. It is also the cheapest by a wide margin.
- GPT-5 sits in the middle.
For a batch of 1,000 receipts:
| Provider | Tier | Typical cost |
|---|---|---|
| Claude | Haiku | $4 to $10 |
| OpenAI | GPT-5 nano | $0.50 to $2 |
| Gemini 2.5 Flash | $0.30 to $1.50 |
Gemini Flash on vision is a steal. If you are doing OCR at any volume, default to it and only fall back to Claude on the documents that fail validation.
Task 5: Customer support and high-volume chat
Replying to 100,000 tickets a month.
This is where the cheap tiers earn their keep. At this volume, even a 2x difference in price shows up as real money.
For a 500-token-in, 200-token-out reply:
| Provider | Tier | Cost per reply | Cost per 100k replies |
|---|---|---|---|
| Anthropic | Haiku | $0.0012 | $120 |
| OpenAI | GPT-5 nano | $0.0001 | $10 |
| Gemini 2.5 Flash-Lite | $0.0001 | $10 |
GPT-5 nano and Gemini Flash-Lite are roughly 12x cheaper than Haiku here. For most support flows the quality is indistinguishable. Reserve Haiku for the tickets that escalate.
Task 6: Agentic workflows
This is the category that breaks naive cost estimates. An agent that runs 30 tool calls to complete a job will burn through context, retry on errors, and rack up output tokens with every reasoning pass.
Two rules:
- Use the flagship for the planner, the workhorse for the workers. Opus or GPT-5 picks the plan. Sonnet or GPT-5 mini executes the steps. This usually cuts the bill by 60 to 80 percent.
- Cache aggressively. Anthropic and OpenAI both offer prompt caching at a 75 to 90 percent discount on cached input tokens. If your system prompt is 8,000 tokens and you run 50 agent steps, that is 400,000 tokens you should not be paying full price for.
A medium-complexity agent run (10 steps, 50k tokens of total context) costs roughly:
| Setup | Cost |
|---|---|
| All Opus | $3 to $8 |
| Opus planner + Sonnet workers | $0.80 to $2 |
| All Sonnet | $0.40 to $1.20 |
| All GPT-5 mini | $0.10 to $0.40 |
If your agent is reliable enough on GPT-5 mini or Sonnet, do not pay for Opus. If it is not reliable, fix the prompts before throwing money at it.
A decision tree you can actually use
When a new task lands on your desk:
- Is it volume-bound (1,000+ runs per day)? Pick Gemini Flash or GPT-5 nano. Test quality on 50 samples. Done.
- Is it quality-bound and one-shot (a contract, a launch post, a board memo)? Pick the flagship of whichever provider you trust most. The cost is rounding error compared to the consequences.
- Is it coding? Sonnet by default. Opus when correctness is non-negotiable. Gemini 2.5 Pro when you need 500k+ tokens of context.
- Is it research with a giant corpus? Gemini 2.5 Pro.
- Is it vision at volume? Gemini Flash.
- Is it agentic? Mixed-model: flagship planner, workhorse workers, cache the system prompt.
Where this gets expensive without you noticing
Three patterns burn through budget quietly:
- Reasoning overhead. Modern frontier models output far more thinking tokens than they used to. A “simple” question can cost 50 cents on Opus if reasoning is set to high. Default to medium or low and turn it up only when you need it.
- Context bloat. Re-sending an entire conversation on every turn is the default in most SDKs. By turn 20, half your bill is recycled context. Trim, summarize, or cache.
- Wrong-tier overkill. Most people pick the flagship because it is the safe choice. For 70 percent of real tasks, the workhorse is indistinguishable in output quality. Run a one-week experiment swapping flagship for workhorse and measure complaint rate. The savings are almost always worth it.
Where tokenkarma fits
Picking the right model is half the job. Knowing what you actually spent is the other half. Every provider’s billing dashboard tells you the total at the end of the month, but none of them tells you that 60 percent of your spend went to one agentic workflow that should have run on Sonnet, not Opus.
That is the gap tokenkarma closes. It pulls usage from Claude, ChatGPT and Gemini into one view, breaks spend down by workflow and model, and surfaces the tasks where you are paying flagship prices for workhorse-grade work. The model-picking framework above gets sharper when you can see the bill it produces.
Bottom line
Treat the three providers as a portfolio, not a religion.
- Claude Sonnet: default for coding, default for writing.
- Claude Opus: when correctness is critical.
- GPT-5: math-heavy reasoning, unfamiliar problems.
- GPT-5 mini / nano: cheapest path for high-volume English text tasks.
- Gemini 2.5 Pro: giant-context research and codebase audits.
- Gemini 2.5 Flash / Flash-Lite: vision at volume and any task where the cost is the bottleneck.
The right answer to “Claude or ChatGPT or Gemini” in 2026 is “yes”. The discipline is knowing which one belongs on which task, and not paying flagship rates for workhorse work.
Now available
Stop guessing your AI limits
The Mac app and web dashboard watch your Claude, ChatGPT, Gemini and more, and warn you before quotas hit.