Retrieval-Augmented Generation has become the default architecture for enterprise search and knowledge systems. The concept is elegant: instead of hoping the model knows the answer, you retrieve relevant documents and let the model synthesize a response from actual sources. In practice, the gap between a working demo and a reliable production system is where most projects fail.
After building RAG systems across multiple industries and document types, we see the same five problems repeatedly. None of them are about the language model.
1. The Chunking Problem
When you split documents into pieces for embedding and retrieval, every choice creates trade-offs. Fixed-size chunks (split every 500 tokens) are easy to implement but terrible for structured documents—they’ll split a procedure in half, separate a rule from its exceptions, and cut tables into meaningless fragments.
The damage shows up at query time. A user asks “what’s the return policy for international orders?” and the retriever pulls back a chunk that contains the domestic return policy (because it mentioned “return policy”) while the international exceptions live in the next chunk that didn’t get retrieved.
What to do instead: Match your chunking strategy to your document structure. For policy documents, chunk at section boundaries. For technical manuals, keep procedures as atomic units. For conversational content like emails or tickets, chunk by message or thread. Test retrieval quality with representative queries before committing to a strategy, and expect to iterate. There is no universal chunking approach that works for all content types.
2. The Stale Index Problem
Your documents change. Policies get updated. Products get modified. Procedures get revised. But if your embedding index doesn’t reflect those changes promptly, your search system is confidently returning outdated information—which in many cases is worse than returning nothing.
We’ve seen organizations build impressive RAG systems, launch them, and then realize three months later that the index hasn’t been updated since launch day because nobody built the update pipeline. The initial load was a batch job that someone ran manually, and incremental updates were “planned for phase two.”
What to do instead: Build the update pipeline before you build the search interface. Treat index freshness as a core requirement, not a follow-on feature. Implement change detection on your document sources, incremental re-processing for modified documents, and monitoring that alerts you when the index falls behind. Define an acceptable staleness window (hours? days?) based on your content and enforce it.
3. The Relevance Gap
Vector similarity search finds documents that are semantically related to the query. That’s not always the same as finding documents that answer the query. A question about “employee termination procedures” will match documents about onboarding (both discuss employment lifecycle), performance reviews (mentions termination as an outcome), and the actual termination policy. Semantic similarity alone can’t reliably distinguish the most relevant from the merely related.
This problem gets worse with large document collections. When you have thousands of documents, even a well-tuned embedding model will return results where several are topically adjacent but not actually useful for the specific question.
What to do instead: Don’t rely on vector search alone. Hybrid retrieval—combining semantic search with keyword matching and metadata filtering—consistently outperforms pure vector search in enterprise settings. If a user asks about “policy MA-2024-031,” keyword matching finds it instantly while vector search might rank it below semantically similar but wrong documents. Add metadata filters (document type, department, date range) to narrow results before ranking. And build a retrieval evaluation set—a collection of queries with known relevant documents—so you can measure and optimize relevance systematically.
4. The Context Window Trap
Modern language models have large context windows—100K tokens or more. It’s tempting to solve retrieval problems by stuffing more documents into the context: “if we’re not sure which document is relevant, include all of them and let the model figure it out.”
This approach degrades answer quality in subtle ways. Research consistently shows that models perform worse on information in the middle of long contexts compared to the beginning and end. More context also means more opportunities for the model to get distracted by tangentially related information. And practically, more tokens means higher latency and cost.
What to do instead: Treat retrieval as a precision problem, not a recall problem. It’s better to give the model three highly relevant passages than thirty somewhat relevant ones. Invest in retrieval quality—better chunking, better ranking, better filtering—rather than compensating for poor retrieval with brute-force context. If you’re regularly stuffing the context window because you can’t reliably find the right documents, that’s a retrieval problem to solve, not a context window to fill.
5. The Evaluation Gap
This might be the most damaging pitfall because it enables all the others. Without systematic evaluation, you don’t know how well your RAG system is working. You rely on anecdotal feedback (“it seems pretty good”), occasional user complaints, and vibes.
The problem is that RAG failures are often silent. The system returns a confident, well-written answer that happens to be sourced from an outdated document, or that’s technically accurate but doesn’t address what the user actually needed. Users who get wrong answers may not report them—they might not even realize the answer was wrong until it causes a downstream problem.
What to do instead: Build evaluation into the system from day one. This means:
- Retrieval evaluation. For a set of test queries, are the right documents being retrieved? Measure precision and recall at your typical retrieval depth (top 3, top 5, whatever you pass to the model).
- Answer evaluation. For queries with known answers, does the system produce correct, well-sourced responses? This requires some human review, but you can build a scalable sampling process.
- Freshness monitoring. Are the documents being returned current? Track the age of retrieved documents relative to the latest version available.
- User feedback loops. Make it easy for users to flag bad answers. Track these signals and use them to identify systematic issues.
Run this evaluation suite regularly—not just at launch, but weekly or after any pipeline change. Quality regressions happen gradually and are easy to miss without measurement.
The Common Thread
All five of these pitfalls share a root cause: treating RAG as a model problem when it’s actually a data engineering problem. The model is the easy part. The hard parts are document processing, chunking, indexing, freshness, retrieval quality, and evaluation—all data pipeline concerns.
Organizations that succeed with enterprise RAG are the ones that invest in these foundational capabilities rather than chasing the latest model upgrade. A well-built retrieval pipeline with a good-enough model will outperform a state-of-the-art model sitting on top of a poor pipeline, every time.
If you’re planning a RAG implementation or struggling with one that’s not meeting expectations, start with these five areas. The fixes are rarely glamorous, but they’re where the actual quality improvements live.