RAG Cost & Architecture Estimator

By Sanjay Saini | Updated: June 12, 2026 | 8 min read

Teams agonise over which vector database to pick, then discover it's a rounding error on the invoice. The real cost of retrieval-augmented generation is the context you inject into the model on every query — top-k × chunk size tokens, paid at generation rates, on every request. This estimator separates the one-time cost of indexing your corpus from the recurring serving cost, lets you compare managed versus self-hosted vector databases and optional reranking, and shows exactly which layer your money goes to.

RAG cost = one-time indexing + recurring (vector DB + reranking + generation).
Retrieved context = top-k × chunk size — it becomes input tokens on every query.
Generation usually dwarfs vector-DB cost; embedding the corpus is near-free by comparison.
Cut context (top-k, chunk size, output) before you optimise infrastructure.

Estimate your RAG cost

Corpus & indexing (one-time + refresh)

Documents in corpus

Avg tokens / document

Chunk size (tokens)

Embedding $/1M tokens

Embedding dimensions

Monthly corpus refresh (%)

Query workload

Queries per month

Top-k chunks retrieved

Question tokens

System prompt tokens

Answer (output) tokens

Retry / overhead (%)

Vector database architecture

Architecture

Storage $/GB-month

Query $/1M searches

Instance $/month

Reranking (optional)

Add a reranking step

Rerank $/1K searches

Fixed monthly overhead (USD)

Generation model (compare)

	Model	Input $/1M	Output $/1M

Estimated cost

Generation model	Context/query	Gen $/query	Generation/mo	Total $/mo	Annual

How RAG cost is calculated

The estimator splits RAG into the two phases that bill differently. Indexing is a one-time job: your corpus tokens (documents times average length) are run through an embedding model once, and the resulting vectors — chunks times embedding dimension times four bytes — sit in a database. Because embedding rates are tiny, indexing even a large library is usually a few dollars; the only recurring part is re-embedding the share of your corpus that changes each month.

Serving is where the money is. Every query embeds the question, runs a vector search, optionally reranks the candidates, then sends the top-k retrieved chunks into the generation model as input context. That retrieved context — top-k multiplied by chunk size — is the dominant input on each call, so generation cost scales with how much you retrieve, not with how clever your database is. The estimator prices the vector database your way (usage-based managed storage and queries, or a fixed self-hosted instance), adds reranking if enabled, then compares generation models on the same retrieval workload. The breakdown makes the lever obvious: in most setups, trimming context or output saves far more than changing vector stores.

Frequently Asked Questions (FAQ)

What are the main cost components of a RAG system?

RAG has a one-time indexing cost to embed your corpus, recurring vector database storage and query cost, optional reranking, and generation cost. Generation usually dominates because every query stuffs the retrieved chunks into the model as input context.

Why does generation dominate RAG cost?

Each query sends the retrieved chunks, equal to top-k multiplied by chunk size, into the model as input tokens, plus the system prompt and question. At realistic query volumes this input far outweighs the cents you spend embedding and querying the vector database.

How do chunk size and top-k affect RAG cost?

Retrieved context equals top-k times chunk size, and that becomes input tokens on every query. Doubling either roughly doubles generation input cost and adds latency, so tune them for recall against budget rather than maximising retrieval blindly.

Is a managed vector database or self-hosting cheaper?

Managed databases bill storage and queries as you go, which suits variable or smaller workloads. Self-hosting trades that for a fixed monthly instance that wins at high, steady volume. For most teams the vector database is a small fraction of total RAG cost either way.

How much does embedding a corpus cost?

Indexing cost equals total corpus tokens divided by one million, multiplied by the embedding model's price. Embedding rates are very low, so a one-time index of even a large corpus is usually inexpensive; the recurring driver is re-embedding refreshed documents.

Does reranking add a lot of cost?

Reranking adds a per-search fee to score retrieved candidates before passing the best to the model. At high query volumes that per-thousand-searches charge can become a meaningful slice of the bill, so weigh the accuracy gain against the added recurring cost.

How often do I need to re-index a RAG corpus?

You only re-embed documents that change or are added. Set the monthly refresh share to the percentage of your corpus that turns over, and the estimator applies the embedding rate to just that portion rather than the whole library each month.

What is the cheapest way to cut RAG cost?

Because generation dominates, reduce retrieved context first: lower top-k, use tighter chunks, rerank to fewer final chunks, and trim output length. Switching to a cheaper generation model for routine queries usually saves far more than optimising the vector database.

How large is a vector index in storage?

Storage roughly equals the number of chunks times the embedding dimension times four bytes for float vectors. The estimator shows the resulting size, which is typically modest and rarely the dominant cost compared with per-query generation.

Does this calculator store my inputs?

Your inputs are saved only in your browser using local storage so the estimator remembers them next time. Nothing is sent to any server, and the reset button clears everything and restores the default scenario instantly.

Sanjay Saini

Product leader and Agile coach at AgileWoW, writing on agentic AI, LLM cost engineering and developer productivity for AI Dev Day India. Connect on LinkedIn