Technical information was last verified on April 2026. The AI/LLM field moves fast — re-check official docs if more than 6 months have passed.
Who should read this
TL;DR: If you think RAG is “throw documents into a vector DB and search,” your system will fail in production. You need to design five layers — chunking, retrieval, reranking, query transformation, and evaluation — and each layer’s accuracy compounds multiplicatively into overall reliability. 95% x 95% x 95% x 95% x 95% = 77%. Drop any single layer and the whole pipeline collapses.
This article is for backend and ML engineers bringing RAG systems to production. All benchmarks and ecosystem references are current as of April 2026.
The five-layer architecture
| Layer | Role | Failure symptom | |
|---|---|---|---|
| 1 | Chunking | Document to semantic units | Retrieval hits the right doc but the answer is incomplete |
| 2 | Retrieval | Query to top-k relevant chunks | Completely irrelevant documents returned |
| 3 | Reranking | Top-k to precision-reordered results | Relevant documents exist but fall outside the ranking window |
| 4 | Query transformation | User question to search-optimized query | Conversational queries fail to retrieve anything useful |
| 5 | Evaluation & monitoring | Pipeline health measurement | Quality degrades silently with no one noticing |
Layer 1 — Chunking strategy
Chunking is the first and most underestimated design decision in any RAG pipeline. Changing your chunking strategy has a bigger impact on retrieval accuracy than swapping your embedding model.
Fixed-size vs semantic vs hierarchical
Fixed-size chunking: Split every 500 tokens. Simple to implement, but cutting mid-sentence destroys meaning. One clinical decision-making study reported just 13% accuracy.
Semantic chunking: Compute embedding similarity between sentences and split where topics shift. Each chunk maps to a single coherent idea. The same study found adaptive chunking achieved 87% accuracy.
Hierarchical chunking: Parent chunks (summaries) point to child chunks (details). Retrieval starts at the parent level and drills down as needed. Best suited for long documents like legal contracts or technical specifications.
Layer 2 — Retrieval: hybrid is the default
Vector similarity search (dense retrieval) alone is not enough.
- Dense only (vector search): recall@10 = 78%
- Sparse only (BM25 keyword): recall@10 = 65%
- Hybrid (dense + sparse): recall@10 = 91%
As of 2026, 72% of production RAG systems use hybrid retrieval. Pinecone, Weaviate, and Qdrant all support hybrid search natively.
# Hybrid search example (Weaviate)
results = client.query.hybrid(
query="chunking strategies for RAG pipelines",
alpha=0.7, # 0=sparse only, 1=dense only, 0.7=dense-weighted
limit=20, # Fetch generously before reranking
) Layer 3 — Reranking
After retrieval pulls the top-20 candidates, a reranker model rescores each chunk’s relevance to the query and compresses the list to the top-5. The retrieval stage casts a wide net for “roughly relevant” chunks; the reranker tightens the ranking to “precisely relevant.”
Leading rerankers: Cohere Rerank, Jina Reranker, and open-source cross-encoder models.
Without reranking, critical information sitting at positions 4 or 5 in the retrieval results gets dropped from the LLM context. This is the primary cause of “the answer exists in the corpus but the LLM doesn’t know it.”
Layer 4 — Query transformation
When a user asks “How much does that cost?”, the vector DB has no idea what “that” refers to. Query transformation rewrites the user’s natural language question into a search-optimized form.
- HyDE (Hypothetical Document Embeddings): Ask the LLM to “imagine a document that answers this question and write it,” then search using the embedding of that hypothetical document
- Multi-query: Rewrite a single question into 3-5 alternative phrasings, search each independently, and merge the results
- Conversational context resolution: Explicitly inject the entity “that” refers to (from prior conversation turns) into the query
Layer 5 — Evaluation and monitoring
You cannot improve what you cannot measure. Three core metrics determine the health of a production RAG system:
- Faithfulness — Does the LLM’s answer actually trace back to the retrieved chunks? (hallucination detection)
- Answer Relevancy — Does the answer actually address the original question?
- Context Precision — What fraction of the retrieved chunks were genuinely useful?
Frameworks like RAGAS and DeepEval provide CI pipelines that automate all three measurements. Build a golden dataset of 50-100 test cases and run regression tests on every deployment.
Pitfalls to avoid
Further reading
- LLM Structured Output: JSON Mode vs Function Calling vs Constrained Decoding — Three approaches for structuring RAG output
- Prompt Version Control for Production AI Services — How to version-manage your RAG prompts
- AWS vs GCP vs Azure: 2026 Startup Cost Comparison — Cloud costs for hosting RAG infrastructure