AI · LLM

RAG Pipeline Design: From Chunking to Retrieval Quality Monitoring

How to architect a production RAG system across five layers — chunking, hybrid retrieval, reranking, query transformation, and evaluation metrics.

Technical information was last verified on April 2026. The AI/LLM field moves fast — re-check official docs if more than 6 months have passed.

Who should read this

TL;DR: If you think RAG is “throw documents into a vector DB and search,” your system will fail in production. You need to design five layers — chunking, retrieval, reranking, query transformation, and evaluation — and each layer’s accuracy compounds multiplicatively into overall reliability. 95% x 95% x 95% x 95% x 95% = 77%. Drop any single layer and the whole pipeline collapses.

This article is for backend and ML engineers bringing RAG systems to production. All benchmarks and ecosystem references are current as of April 2026.

The five-layer architecture

LayerRoleFailure symptom
1 ChunkingDocument to semantic unitsRetrieval hits the right doc but the answer is incomplete
2 RetrievalQuery to top-k relevant chunksCompletely irrelevant documents returned
3 RerankingTop-k to precision-reordered resultsRelevant documents exist but fall outside the ranking window
4 Query transformationUser question to search-optimized queryConversational queries fail to retrieve anything useful
5 Evaluation & monitoringPipeline health measurementQuality degrades silently with no one noticing
Each layer's accuracy multiplies into overall pipeline reliability.

Layer 1 — Chunking strategy

Chunking is the first and most underestimated design decision in any RAG pipeline. Changing your chunking strategy has a bigger impact on retrieval accuracy than swapping your embedding model.

Fixed-size vs semantic vs hierarchical

Fixed-size chunking: Split every 500 tokens. Simple to implement, but cutting mid-sentence destroys meaning. One clinical decision-making study reported just 13% accuracy.

Semantic chunking: Compute embedding similarity between sentences and split where topics shift. Each chunk maps to a single coherent idea. The same study found adaptive chunking achieved 87% accuracy.

Hierarchical chunking: Parent chunks (summaries) point to child chunks (details). Retrieval starts at the parent level and drills down as needed. Best suited for long documents like legal contracts or technical specifications.

Layer 2 — Retrieval: hybrid is the default

Vector similarity search (dense retrieval) alone is not enough.

  • Dense only (vector search): recall@10 = 78%
  • Sparse only (BM25 keyword): recall@10 = 65%
  • Hybrid (dense + sparse): recall@10 = 91%

As of 2026, 72% of production RAG systems use hybrid retrieval. Pinecone, Weaviate, and Qdrant all support hybrid search natively.

retrieval.py Python
# Hybrid search example (Weaviate)
results = client.query.hybrid(
    query="chunking strategies for RAG pipelines",
    alpha=0.7,  # 0=sparse only, 1=dense only, 0.7=dense-weighted
    limit=20,   # Fetch generously before reranking
)

Layer 3 — Reranking

After retrieval pulls the top-20 candidates, a reranker model rescores each chunk’s relevance to the query and compresses the list to the top-5. The retrieval stage casts a wide net for “roughly relevant” chunks; the reranker tightens the ranking to “precisely relevant.”

Leading rerankers: Cohere Rerank, Jina Reranker, and open-source cross-encoder models.

Without reranking, critical information sitting at positions 4 or 5 in the retrieval results gets dropped from the LLM context. This is the primary cause of “the answer exists in the corpus but the LLM doesn’t know it.”

Layer 4 — Query transformation

When a user asks “How much does that cost?”, the vector DB has no idea what “that” refers to. Query transformation rewrites the user’s natural language question into a search-optimized form.

  • HyDE (Hypothetical Document Embeddings): Ask the LLM to “imagine a document that answers this question and write it,” then search using the embedding of that hypothetical document
  • Multi-query: Rewrite a single question into 3-5 alternative phrasings, search each independently, and merge the results
  • Conversational context resolution: Explicitly inject the entity “that” refers to (from prior conversation turns) into the query

Layer 5 — Evaluation and monitoring

You cannot improve what you cannot measure. Three core metrics determine the health of a production RAG system:

  1. Faithfulness — Does the LLM’s answer actually trace back to the retrieved chunks? (hallucination detection)
  2. Answer Relevancy — Does the answer actually address the original question?
  3. Context Precision — What fraction of the retrieved chunks were genuinely useful?

Frameworks like RAGAS and DeepEval provide CI pipelines that automate all three measurements. Build a golden dataset of 50-100 test cases and run regression tests on every deployment.

Pitfalls to avoid

Further reading