Vector databases look cheap at 100k records. At 10 million, the bill becomes painful, and for many use cases, simpler markdown + search often wins anyway.
Most teams treat RAG as the obvious default for any knowledge base. They chunk documents, generate embeddings, load everything into a vector database, and assume the combination will stay scalable and cheap while delivering superior answers. This feels like responsible engineering. It is not.
The false assumption is that RAG is inherently the scalable, low-maintenance solution. In reality it is a seductive trap that quietly turns into expensive over-engineering. Teams convince themselves they are being sophisticated. They are actually building layers of complexity that explode in cost and maintenance once data volume grows. Many later regret not starting simpler.
Last year I worked with a Series B team building an internal knowledge assistant. They started with Pinecone because it was easy. At 800k records the monthly bill was around $90. When they crossed 12 million records with moderate query traffic, costs jumped to $1,150 per month. Retrieval quality was acceptable but not transformative. Engineers spent weeks tuning chunk sizes, adding rerankers, and fighting stale embeddings. The founder finally asked the hard question: how much of this infrastructure existed because it sounded advanced rather than because it was necessary?
Where the Assumption Breaks
The mental model that RAG is always the right default collapses quickly in production.
Seventy percent of knowledge-base queries in most SaaS products are repetitive or narrow. Vector search adds unnecessary overhead for these cases. Embedding pipelines and frequent re-indexing create ongoing costs that teams rarely forecast. Retrieval noise forces heavier context windows downstream, inflating token spend on the LLM side. Maintenance tasks like monitoring drift, updating indexes, and debugging irrelevant results eat engineering time that could go to actual product work.
For slowly changing or well-structured data, simple full-text search plus a capable LLM often matches or beats sophisticated RAG in both speed and total cost.
The concrete example above is not an edge case. It is the standard trajectory for teams that default to RAG without questioning the actual retrieval need.
How the Cost Curve Shifts Across Stages
Early stage feels innocent. Under 500k records and light query volume, a managed vector DB stays cheap, often $50–$150 per month. You get fast prototyping and decent results with almost zero ops work. The team ships the feature quickly and moves on. Over-engineering regret stays hidden because absolute dollars are small.
Growth stage triggers the first real pain. Between 1 million and 10 million records, costs spike as query volume rises and data updates become frequent. Managed services push monthly bills to $400–$1,200. Teams add reranking and caching to salvage quality, which drives up both latency and downstream token costs. Many start eyeing self-hosted options, but the migration feels heavy after already sinking time into the managed path.
Scale stage demands brutal honesty. At 10 million+ records and sustained traffic, managed vector DB costs easily reach $2,000–$5,000+ monthly. Self-hosted setups can flatten the curve but require real infrastructure discipline. This is when over-engineering regret hits hardest. Many teams discover their fancy RAG delivers only marginal gains over simpler retrieval while consuming disproportionate budget and attention. The smartest ones begin killing unnecessary parts of the stack.
The Brutal Cost Curve at Scale
Here is a realistic breakdown for roughly 10 million 1536-dimension vectors with moderate traffic of 5–10 million queries per month.
| Vector DB Approach | Monthly Cost (10M records) | Cost at 50M records | Main Cost Drivers | When It Hurts Most |
|---|---|---|---|---|
| Pinecone / Weaviate Managed | $300 – $1,200 | $1,500 – $4,000+ | Storage + read/write units | Spiky queries or frequent updates |
| Qdrant / Chroma Cloud | $150 – $600 | $800 – $2,000 | Allocated RAM / vCPU | Heavy filtering and real-time indexing |
| Self-hosted Qdrant / Chroma | $100 – $400 (infra) | $300 – $800 | Compute, storage, ops time | Teams lacking infra expertise |
| Simple Markdown + Hybrid Search | $20 – $80 | $50 – $150 | Storage + LLM tokens only | Almost never |
The table reveals the uncomfortable truth. Beyond 10 million records the cost gap widens dramatically. The simplest path, structured markdown or a lightweight database with keyword plus basic hybrid search, stays dramatically cheaper while delivering sufficient quality for most internal or domain-specific use cases.
Hidden Costs and Over-Engineering Regret
Most teams only track vector storage and query units. They miss the full picture.
Embedding generation and re-embedding on updates add a steady tax. Retrieval noise increases average context length, directly inflating LLM token costs. Engineering time spent on chunking strategies, reranker tuning, and drift monitoring compounds quietly. Latency creep hurts user experience and indirectly increases churn or support load.
“RAG is not intelligence. It is just expensive search with extra steps, and the steps get very expensive at scale.”
The Anchor Insight: Over-Engineering Regret Is the Real Killer
Here is the part that should make you uncomfortable. Most teams do not have a hard retrieval problem. They have a sophistication bias.
They reach for vector databases because semantic search sounds more advanced than boring keyword search. They layer on chunking heuristics, multi-stage pipelines, rerankers, and agentic flows because every tutorial says RAG needs complexity. The result is a fragile, high-maintenance system that is difficult to debug and costly to run.
For the majority of real-world knowledge bases, internal docs, product manuals, support articles, policy repositories, a clean markdown structure with BM25 or lightweight hybrid search plus a strong LLM provides 85–95% of the effective quality at a fraction of the cost and operational burden.
The core mechanism is brutal: retrieval quality plateaus surprisingly early for most practical workloads. Each additional RAG layer buys smaller and smaller gains while costs and complexity grow linearly or worse.
We broke down similar cost explosions in AI systems here.
| Approach Complexity | Typical Setup | Marginal Quality Gain | Real Total Cost Tax | Typical Regret Level |
|---|---|---|---|---|
| Low (Simple) | Markdown + keyword/hybrid search | Baseline | Minimal storage + tokens | Low |
| Medium | Basic vector DB + top-k retrieval | +10–20% | Storage, queries, chunk management | Medium |
| High (Fancy RAG) | Multi-stage + reranker + agents | +5–15% marginal | Ops, latency, token bloat, drift | High |
Internalize this table and you will start questioning every RAG implementation on your roadmap before it becomes legacy debt.
The When-to-Kill-RAG Decision Framework
Use this matrix before committing more resources to vector infrastructure.
- Low volume + low complexity → Skip vector DB. Simple search plus strong prompting is enough.
- Medium volume + medium complexity → Basic vector or hybrid search can work if kept lean.
- High volume + low complexity → Kill the vector layer. Structured search and caching win on cost and speed.
- High volume + high complexity → Consider full RAG, but plan self-hosted migration early and set strict cost/quality thresholds.
The framework forces discipline. If your knowledge base is mostly static or well-structured, you are probably already in the quadrant where RAG should be killed or heavily simplified.
Fancy RAG vs Simple Retrieval + LLM: The Real Trade-off
| Decision | What You Gain | What You Pay | When It Breaks |
|---|---|---|---|
| Fancy Multi-stage RAG | Higher recall on ambiguous or rare queries | High infra, token, and ops cost | Beyond 5–10M records or frequent updates |
| Basic Vector DB | Decent semantic search with easy start | Ongoing query costs + maintenance | Data changes often |
| Simple Markdown + Search + LLM | Predictable low cost, low latency, simple | Slightly lower recall on very edge cases | Extremely diverse or noisy knowledge |
Simple retrieval plus a capable LLM frequently wins because the model itself can reason effectively over reasonably good context. The marginal improvement from perfect retrieval rarely justifies the exponential increase in complexity and cost.
We discussed deeper inference economics in our piece on inference at scale and why token price drops do not always translate to lower bills in this analysis.
Decision Guidance That Matches Reality
Early stage (under 1M records): RAG via managed service is acceptable if it helps you ship faster. Instrument costs and actual retrieval quality from day one. Avoid premature optimization into advanced techniques.
Growth stage (1M–10M records): Monitor the cost curve weekly. Run parallel experiments with self-hosted options and simpler retrieval. Test dropping vector components entirely on a subset of queries. This is usually the stage where over-engineering regret becomes expensive.
Scale stage (10M+ records): Conduct ruthless A/B testing between full RAG and simplified retrieval. Most teams find they can remove large parts of the pipeline with no measurable drop in user satisfaction. Prioritize cost predictability and engineering velocity over marginal gains in recall.
The guidance is not anti-RAG. It is anti-default. Stop letting RAG become the automatic tax on every knowledge feature.
The Two Mistakes That Still Burn Teams
First, teams treat RAG as a one-time setup. They build the pipeline, celebrate the demo, and never revisit whether the complexity remains justified as data and usage evolve. Costs compound quietly while quality plateaus.
Second, they chase perfect retrieval instead of good-enough context for the LLM. They invest weeks tuning embeddings and rerankers when cleaner knowledge organization plus better prompts would solve the problem faster and cheaper.
What Actually Matters
The real question is not whether RAG can work.
It is whether you have the discipline to admit when it has become overkill for your actual use case.
Most teams optimize for impressive technical demos with vector search.
Smart teams optimize for sustainable cost, velocity, and maintainability.
They are willing to kill their own sophisticated infrastructure when simpler approaches deliver better economics.
You already know if your current RAG setup is delivering value proportional to its growing cost.
The uncomfortable part is acting on that knowledge before the bill at 10 million records forces the decision for you.