LLM Token Cost & TCO Calculator
Headline pricing pages tell you the rate per million tokens — they don't tell you what your feature will actually cost once you multiply by real traffic, output length and retries. This calculator turns the numbers you can measure (requests, tokens, cache rate) into a monthly bill and a true total cost of ownership, then ranks GPT-5, Claude, Gemini and DeepSeek side by side so you can pick the cheapest model that meets your quality bar.
- Cost = (input tokens × input rate) + (output tokens × output rate), summed across monthly requests.
- Output tokens usually dominate the bill — they're priced 2–8× higher than input.
- Prompt caching can discount repeated context by ~90%, which matters most for agents and RAG.
- TCO adds retries, observability and engineering overhead on top of raw token spend.
Model your monthly LLM cost
| Model | Input $/1M | Output $/1M |
|---|
Estimated cost
| Model | Tokens/mo | Input $/mo | Output $/mo | Cost/request | Total $/mo | Annual TCO |
|---|
How the calculation works
Every provider bills two separate streams: the tokens you send in (your prompt, system message, retrieved context and conversation history) and the tokens the model generates back. The calculator multiplies your monthly request volume by the average tokens in each stream, converts to millions of tokens, and applies each model's input and output rate. Output is almost always the expensive half, so a feature that returns long answers will cost more than its prompt size suggests.
Two adjustments turn a raw estimate into something closer to your real invoice. Prompt caching rebills repeated input context at a fraction of the standard rate — set your realistic cache hit rate and watch the input column drop, which is where agents and RAG pipelines with large fixed system prompts save the most. The retry and overhead percentage inflates the total to account for failed calls, fallback model hops and evaluation runs, and the fixed monthly figure adds the flat platform costs — observability, a model gateway, ops time — that headline pricing never includes. Together those give you an annual total cost of ownership rather than a best-case token bill.
Frequently Asked Questions (FAQ)
LLM cost is the sum of input and output tokens multiplied by each provider's per-million-token price. Total monthly cost equals requests times average tokens per request, divided by one million, multiplied by the relevant input and output rates.
Output tokens are generated one at a time and require far more compute than reading input, so most providers price output two to eight times higher. Verbose responses therefore dominate cost, which is why trimming output length saves more than trimming prompts.
TCO extends raw token spend with retries, fallback model calls, observability, evaluation runs and engineering overhead. A realistic budget adds a retry buffer and a fixed monthly platform cost on top of the headline per-token price you see on a provider's pricing page.
Prompt caching bills repeated input context at a steep discount, often around ninety percent off the standard input rate. For agents and RAG systems that resend large system prompts, a high cache hit rate can cut total input cost dramatically without changing output spend.
Open-weight models like DeepSeek typically post far lower per-token API prices than frontier closed models. Whether they are cheaper for you depends on accuracy at your task, since a weaker model that needs retries or longer outputs can erase its headline price advantage.
A rough rule is that one token is about four English characters or three quarters of a word. Measure real requests with your provider's tokenizer for accuracy, then enter average input and output token counts into the calculator to model monthly volume.
Self-hosting trades per-token API fees for fixed GPU and operations cost. It only wins above a high, steady request volume where the fixed spend is amortised. Below that break-even, pay-as-you-go APIs are usually cheaper and far less operational work.
The defaults are illustrative starting points and providers change pricing frequently. Every price field is editable, so paste the current input and output rates from each provider's pricing page to get figures you can rely on for budgeting.
For most production systems the biggest drivers are output token volume, large repeated system prompts, and retries from failed or low-quality responses. Reducing output length, caching context, and improving first-pass accuracy usually cut cost faster than switching models.
Your inputs are saved only in your own browser using local storage so the calculator remembers them on your next visit. Nothing is sent to a server, and you can clear everything at any time with the reset button.