RAG failure modes: diagnosis checklist and mitigation map

Q: Which RAG failure mode is most common in production?

In most diagnostics, retrieval miss and chunking error account for the majority of grounded-eval regressions, followed by citation mismatch. Hallucination dominates only after the retrieval layer is already strong—weak retrieval makes hallucination look like the bug when it is downstream of a recall problem.

Q: How do I tell a chunking error from a ranking error?

Inspect the top-50 candidates. If the right answer never appears in any chunk, you have a chunking or indexing problem. If it appears but ranks below position 3, you have a ranking problem and need a cross-encoder re-ranker or better query rewriting.

Q: Will a bigger LLM fix RAG hallucinations?

Partially. Bigger models follow "answer only from context" instructions more reliably, but they cannot fix retrieval miss, stale data, or citation mismatch. Treat the LLM as the last layer—fix the pipeline first, then upgrade the model.

Q: What's the minimum eval set to debug RAG?

Start with about 100 hand-labeled queries with gold answers and gold evidence spans. Track Recall@k, faithfulness, and citation accuracy. Below this size, metric noise will hide real regressions and ablation studies become unreliable.

Q: How often should I re-audit RAG failure modes?

Run the full seven-mode checklist whenever you change embeddings, the chunker, the re-ranker, or the base LLM. At minimum, re-audit quarterly aligned to the 90-day review cycle for source documents.

Most RAG bugs fall into seven repeatable modes—retrieval miss, stale source, chunking error, ranking error, context overflow, hallucination, and citation mismatch. Diagnose them with retrieval traces and grounded eval, then apply the matching mitigation in the index, retriever, prompt, or generator layer.

TL;DR

When a retrieval-augmented generation system gives the wrong answer, the bug is almost never "the LLM lied." It is one of seven specific failures somewhere in the pipeline. This checklist helps you locate the layer at fault and apply a targeted fix—instead of blindly tweaking the prompt.

How to use this checklist

Run the seven checks in order. Each block contains:

Symptom — what the user sees.
Diagnostic signal — what to log or measure to confirm.
Mitigation map — concrete fixes by pipeline layer.

Stop at the first failure mode that confirms; multiple modes can co-occur, but the dominant one usually drives the visible error. Always run diagnostics on a labeled eval set, not on cherry-picked production traces.

1. Retrieval miss

[ ] Symptom: Answer ignores a fact you know is in the corpus.
[ ] Diagnostic signal: Log top-k chunks per query. The relevant document is not in the top-k.
[ ] Recall@k drops below your evaluation threshold (commonly <80% for top-10).
[ ] Mitigation map:
Index layer: Re-embed with a stronger or domain-tuned embedding model; add hybrid BM25 + vector retrieval.
Retriever layer: Increase k, add query rewriting, add HyDE-style query expansion for vague questions.
Data layer: Verify the document is actually indexed (timestamps, ingestion logs, dedup hash collisions).

2. Stale source

[ ] Symptom: Answer is confidently wrong because it cites an outdated document.
[ ] Diagnostic signal: Cited chunk's updated_at is older than the canonical fact's last change.
[ ] Mitigation map:
Index layer: Add freshness metadata; downrank stale documents in the ranker.
Pipeline layer: Schedule re-ingestion; mark deprecated docs with a superseded_by field.
Prompt layer: Pass as_of_date in the system prompt and instruct the model to flag stale evidence rather than answer it.

3. Chunking error

[ ] Symptom: Retrieved chunk contains the keyword but not the answer (sentence got split).
[ ] Diagnostic signal: Top-k chunks are short, mid-paragraph, or cut tables, lists, or code blocks in half.
[ ] Mitigation map:
Chunker layer: Switch to semantic or structure-aware chunking (respect headings, sentences, code blocks).
Index layer: Add 10-20% overlap and parent-document retrieval (small chunks for matching, full sections for context).
Eval layer: Add a chunking unit test: assert that gold spans are not split across chunk boundaries.

4. Ranking error

[ ] Symptom: The right chunk is in top-k but not in top-3—and the LLM ignores it.
[ ] Diagnostic signal: nDCG@10 looks fine, but nDCG@3 drops sharply on hard queries.
[ ] Mitigation map:
Re-ranker layer: Add a cross-encoder re-ranker over the top-50 candidates.
Prompt layer: Sort context by relevance score; place the strongest evidence near the start and end of the prompt.
Eval layer: Maintain a labeled query→doc set; track Mean Reciprocal Rank and per-bucket nDCG.

5. Context overflow

[ ] Symptom: Long answers drop key facts that were in the prompt.
[ ] Diagnostic signal: "Lost in the middle" pattern—facts in the middle of long context are ignored.
[ ] Mitigation map:
Retriever layer: Cap context tokens; deduplicate near-duplicate chunks before concatenation.
Compression layer: Summarize per-doc before concat; use map-reduce or hierarchical chains for many-doc questions.
Prompt layer: Place the most important chunks at the start and end of context; avoid burying gold evidence mid-prompt.

6. Hallucination (ungrounded generation)

[ ] Symptom: Answer asserts a fact not present in any retrieved chunk.
[ ] Diagnostic signal: Faithfulness/groundedness score (e.g., RAGAS faithfulness) is low; spans cannot be aligned to source text.
[ ] Mitigation map:
Generator layer: Use a stronger instruction-following model; add an "answer only from context, otherwise say 'I don't know'" guard.
Prompt layer: Require structured citation ([doc_id]) on every claim; reject responses missing citations.
Verifier layer: Add a post-generation grounding check that re-retrieves each claim and discards unsupported ones.

7. Citation mismatch

[ ] Symptom: Answer cites a document that does not actually support the claim.
[ ] Diagnostic signal: NLI/entailment check between cited chunk and answer span returns "neutral" or "contradicts."
[ ] Mitigation map:
Generator layer: Constrain decoding to per-claim citation; emit JSON with claim + evidence_span per item.
Verifier layer: Run an entailment classifier; auto-replace failing citations or refuse the answer.
UX layer: Surface span-level highlights so reviewers can spot drift quickly during sampling.

Mitigation map at a glance

Failure mode	Index	Retriever	Re-ranker	Prompt	Generator	Verifier
Retrieval miss	Hybrid + better embeddings	k↑, query rewrite	—	—	—	—
Stale source	Freshness fields	Filter by as_of	Downrank stale	as_of_date	Flag stale	—
Chunking error	Overlap, parent-doc	—	—	—	—	—
Ranking error	—	Top-50 candidates	Cross-encoder	Reorder context	—	—
Context overflow	—	Dedupe	—	Reorder, compress	—	—
Hallucination	—	—	—	"Answer only from context"	Stronger model	Groundedness check
Citation mismatch	—	—	—	Per-claim citation	Structured output	Entailment classifier

Operational checklist before shipping

[ ] Logged retrieval traces (query, top-k, scores) for ≥95% of requests.
[ ] Eval set with ≥100 labeled query→answer→evidence triples, refreshed quarterly.
[ ] Tracked metrics: Recall@k, nDCG@10, faithfulness, citation accuracy.
[ ] Alert on faithfulness <0.85 or citation accuracy <0.9.
[ ] Ablation runs: disable re-ranker, disable verifier—confirm each adds measurable lift before keeping it in the pipeline.
[ ] Red-team set covering each of the seven failure modes, run on every model or embedding change.

For deeper context, see the hub on retrieval-augmented generation, vector search quality metrics, chunking strategies for RAG, and evaluating RAG pipelines.

FAQ

Q: Which RAG failure mode is most common in production?

In most diagnostics, retrieval miss and chunking error account for the majority of grounded-eval regressions, followed by citation mismatch. Hallucination dominates only after the retrieval layer is already strong—weak retrieval makes hallucination look like the bug when it is downstream of a recall problem.

Q: How do I tell a chunking error from a ranking error?

Inspect the top-50 candidates. If the right answer never appears in any chunk, you have a chunking or indexing problem. If it appears but ranks below position 3, you have a ranking problem and need a cross-encoder re-ranker or better query rewriting.

Q: Will a bigger LLM fix RAG hallucinations?

Partially. Bigger models follow "answer only from context" instructions more reliably, but they cannot fix retrieval miss, stale data, or citation mismatch. Treat the LLM as the last layer—fix the pipeline first, then upgrade the model.

Q: What's the minimum eval set to debug RAG?

Start with about 100 hand-labeled queries with gold answers and gold evidence spans. Track Recall@k, faithfulness, and citation accuracy. Below this size, metric noise will hide real regressions and ablation studies become unreliable.

Q: How often should I re-audit RAG failure modes?

Run the full seven-mode checklist whenever you change embeddings, the chunker, the re-ranker, or the base LLM. At minimum, re-audit quarterly aligned to the 90-day review cycle for source documents.

RAG failure modes: diagnosis checklist and mitigation map

TL;DR

How to use this checklist

1. Retrieval miss

2. Stale source

3. Chunking error

4. Ranking error

5. Context overflow

6. Hallucination (ungrounded generation)

7. Citation mismatch

Mitigation map at a glance

Operational checklist before shipping

FAQ

Q: Which RAG failure mode is most common in production?

Q: How do I tell a chunking error from a ranking error?

Q: Will a bigger LLM fix RAG hallucinations?

Q: What's the minimum eval set to debug RAG?

Q: How often should I re-audit RAG failure modes?

Related Articles

Grounding vs Fact-Checking: What's the Difference in AI Content Workflows?

404 Page AI Crawler Handling: Avoiding Citation Loss During Migrations

Accept-Encoding (Brotli, Gzip) for AI Crawlers

GEO & AI Search Insights