RAG Cost & Architecture Estimator
Teams agonise over which vector database to pick, then discover it's a rounding error on the
invoice. The real cost of retrieval-augmented generation is the context you inject into the model on every query —
top-k × chunk size tokens, paid at generation rates, on every request. This estimator separates the
one-time cost of indexing your corpus from the recurring serving cost, lets you compare managed versus self-hosted
vector databases and optional reranking, and shows exactly which layer your money goes to.
- RAG cost = one-time indexing + recurring (vector DB + reranking + generation).
- Retrieved context = top-k × chunk size — it becomes input tokens on every query.
- Generation usually dwarfs vector-DB cost; embedding the corpus is near-free by comparison.
- Cut context (top-k, chunk size, output) before you optimise infrastructure.
Estimate your RAG cost
| Model | Input $/1M | Output $/1M |
|---|
Estimated cost
| Generation model | Context/query | Gen $/query | Generation/mo | Total $/mo | Annual |
|---|
How RAG cost is calculated
The estimator splits RAG into the two phases that bill differently. Indexing is a one-time job: your corpus tokens (documents times average length) are run through an embedding model once, and the resulting vectors — chunks times embedding dimension times four bytes — sit in a database. Because embedding rates are tiny, indexing even a large library is usually a few dollars; the only recurring part is re-embedding the share of your corpus that changes each month.
Serving is where the money is. Every query embeds the question, runs a vector search, optionally reranks the candidates, then sends the top-k retrieved chunks into the generation model as input context. That retrieved context — top-k multiplied by chunk size — is the dominant input on each call, so generation cost scales with how much you retrieve, not with how clever your database is. The estimator prices the vector database your way (usage-based managed storage and queries, or a fixed self-hosted instance), adds reranking if enabled, then compares generation models on the same retrieval workload. The breakdown makes the lever obvious: in most setups, trimming context or output saves far more than changing vector stores.
Frequently Asked Questions (FAQ)
RAG has a one-time indexing cost to embed your corpus, recurring vector database storage and query cost, optional reranking, and generation cost. Generation usually dominates because every query stuffs the retrieved chunks into the model as input context.
Each query sends the retrieved chunks, equal to top-k multiplied by chunk size, into the model as input tokens, plus the system prompt and question. At realistic query volumes this input far outweighs the cents you spend embedding and querying the vector database.
Retrieved context equals top-k times chunk size, and that becomes input tokens on every query. Doubling either roughly doubles generation input cost and adds latency, so tune them for recall against budget rather than maximising retrieval blindly.
Managed databases bill storage and queries as you go, which suits variable or smaller workloads. Self-hosting trades that for a fixed monthly instance that wins at high, steady volume. For most teams the vector database is a small fraction of total RAG cost either way.
Indexing cost equals total corpus tokens divided by one million, multiplied by the embedding model's price. Embedding rates are very low, so a one-time index of even a large corpus is usually inexpensive; the recurring driver is re-embedding refreshed documents.
Reranking adds a per-search fee to score retrieved candidates before passing the best to the model. At high query volumes that per-thousand-searches charge can become a meaningful slice of the bill, so weigh the accuracy gain against the added recurring cost.
You only re-embed documents that change or are added. Set the monthly refresh share to the percentage of your corpus that turns over, and the estimator applies the embedding rate to just that portion rather than the whole library each month.
Because generation dominates, reduce retrieved context first: lower top-k, use tighter chunks, rerank to fewer final chunks, and trim output length. Switching to a cheaper generation model for routine queries usually saves far more than optimising the vector database.
Storage roughly equals the number of chunks times the embedding dimension times four bytes for float vectors. The estimator shows the resulting size, which is typically modest and rarely the dominant cost compared with per-query generation.
Your inputs are saved only in your browser using local storage so the estimator remembers them next time. Nothing is sent to any server, and the reset button clears everything and restores the default scenario instantly.