Geodocs.dev

RAG failure modes: diagnosis checklist and mitigation map

ShareLinkedIn

Open this article in your favorite AI assistant for deeper analysis, summaries, or follow-up questions.

Most RAG bugs fall into seven repeatable modes—retrieval miss, stale source, chunking error, ranking error, context overflow, hallucination, and citation mismatch. Diagnose them with retrieval traces and grounded eval, then apply the matching mitigation in the index, retriever, prompt, or generator layer.

TL;DR

When a retrieval-augmented generation system gives the wrong answer, the bug is almost never "the LLM lied." It is one of seven specific failures somewhere in the pipeline. This checklist helps you locate the layer at fault and apply a targeted fix—instead of blindly tweaking the prompt.

How to use this checklist

Run the seven checks in order. Each block contains:

  • Symptom — what the user sees.
  • Diagnostic signal — what to log or measure to confirm.
  • Mitigation map — concrete fixes by pipeline layer.

Stop at the first failure mode that confirms; multiple modes can co-occur, but the dominant one usually drives the visible error. Always run diagnostics on a labeled eval set, not on cherry-picked production traces.

1. Retrieval miss

  • [ ] Symptom: Answer ignores a fact you know is in the corpus.
  • [ ] Diagnostic signal: Log top-k chunks per query. The relevant document is not in the top-k.
  • [ ] Recall@k drops below your evaluation threshold (commonly <80% for top-10).
  • [ ] Mitigation map:
  • Index layer: Re-embed with a stronger or domain-tuned embedding model; add hybrid BM25 + vector retrieval.
  • Retriever layer: Increase k, add query rewriting, add HyDE-style query expansion for vague questions.
  • Data layer: Verify the document is actually indexed (timestamps, ingestion logs, dedup hash collisions).

2. Stale source

  • [ ] Symptom: Answer is confidently wrong because it cites an outdated document.
  • [ ] Diagnostic signal: Cited chunk's updated_at is older than the canonical fact's last change.
  • [ ] Mitigation map:
  • Index layer: Add freshness metadata; downrank stale documents in the ranker.
  • Pipeline layer: Schedule re-ingestion; mark deprecated docs with a superseded_by field.
  • Prompt layer: Pass as_of_date in the system prompt and instruct the model to flag stale evidence rather than answer it.

3. Chunking error

  • [ ] Symptom: Retrieved chunk contains the keyword but not the answer (sentence got split).
  • [ ] Diagnostic signal: Top-k chunks are short, mid-paragraph, or cut tables, lists, or code blocks in half.
  • [ ] Mitigation map:
  • Chunker layer: Switch to semantic or structure-aware chunking (respect headings, sentences, code blocks).
  • Index layer: Add 10-20% overlap and parent-document retrieval (small chunks for matching, full sections for context).
  • Eval layer: Add a chunking unit test: assert that gold spans are not split across chunk boundaries.

4. Ranking error

  • [ ] Symptom: The right chunk is in top-k but not in top-3—and the LLM ignores it.
  • [ ] Diagnostic signal: nDCG@10 looks fine, but nDCG@3 drops sharply on hard queries.
  • [ ] Mitigation map:
  • Re-ranker layer: Add a cross-encoder re-ranker over the top-50 candidates.
  • Prompt layer: Sort context by relevance score; place the strongest evidence near the start and end of the prompt.
  • Eval layer: Maintain a labeled query→doc set; track Mean Reciprocal Rank and per-bucket nDCG.

5. Context overflow

  • [ ] Symptom: Long answers drop key facts that were in the prompt.
  • [ ] Diagnostic signal: "Lost in the middle" pattern—facts in the middle of long context are ignored.
  • [ ] Mitigation map:
  • Retriever layer: Cap context tokens; deduplicate near-duplicate chunks before concatenation.
  • Compression layer: Summarize per-doc before concat; use map-reduce or hierarchical chains for many-doc questions.
  • Prompt layer: Place the most important chunks at the start and end of context; avoid burying gold evidence mid-prompt.

6. Hallucination (ungrounded generation)

  • [ ] Symptom: Answer asserts a fact not present in any retrieved chunk.
  • [ ] Diagnostic signal: Faithfulness/groundedness score (e.g., RAGAS faithfulness) is low; spans cannot be aligned to source text.
  • [ ] Mitigation map:
  • Generator layer: Use a stronger instruction-following model; add an "answer only from context, otherwise say 'I don't know'" guard.
  • Prompt layer: Require structured citation ([doc_id]) on every claim; reject responses missing citations.
  • Verifier layer: Add a post-generation grounding check that re-retrieves each claim and discards unsupported ones.

7. Citation mismatch

  • [ ] Symptom: Answer cites a document that does not actually support the claim.
  • [ ] Diagnostic signal: NLI/entailment check between cited chunk and answer span returns "neutral" or "contradicts."
  • [ ] Mitigation map:
  • Generator layer: Constrain decoding to per-claim citation; emit JSON with claim + evidence_span per item.
  • Verifier layer: Run an entailment classifier; auto-replace failing citations or refuse the answer.
  • UX layer: Surface span-level highlights so reviewers can spot drift quickly during sampling.

Mitigation map at a glance

Failure modeIndexRetrieverRe-rankerPromptGeneratorVerifier
Retrieval missHybrid + better embeddingsk↑, query rewrite
Stale sourceFreshness fieldsFilter by as_ofDownrank staleas_of_dateFlag stale
Chunking errorOverlap, parent-doc
Ranking errorTop-50 candidatesCross-encoderReorder context
Context overflowDedupeReorder, compress
Hallucination"Answer only from context"Stronger modelGroundedness check
Citation mismatchPer-claim citationStructured outputEntailment classifier

Operational checklist before shipping

  • [ ] Logged retrieval traces (query, top-k, scores) for ≥95% of requests.
  • [ ] Eval set with ≥100 labeled query→answer→evidence triples, refreshed quarterly.
  • [ ] Tracked metrics: Recall@k, nDCG@10, faithfulness, citation accuracy.
  • [ ] Alert on faithfulness <0.85 or citation accuracy <0.9.
  • [ ] Ablation runs: disable re-ranker, disable verifier—confirm each adds measurable lift before keeping it in the pipeline.
  • [ ] Red-team set covering each of the seven failure modes, run on every model or embedding change.

For deeper context, see the hub on retrieval-augmented generation, vector search quality metrics, chunking strategies for RAG, and evaluating RAG pipelines.

FAQ

Q: Which RAG failure mode is most common in production?

In most diagnostics, retrieval miss and chunking error account for the majority of grounded-eval regressions, followed by citation mismatch. Hallucination dominates only after the retrieval layer is already strong—weak retrieval makes hallucination look like the bug when it is downstream of a recall problem.

Q: How do I tell a chunking error from a ranking error?

Inspect the top-50 candidates. If the right answer never appears in any chunk, you have a chunking or indexing problem. If it appears but ranks below position 3, you have a ranking problem and need a cross-encoder re-ranker or better query rewriting.

Q: Will a bigger LLM fix RAG hallucinations?

Partially. Bigger models follow "answer only from context" instructions more reliably, but they cannot fix retrieval miss, stale data, or citation mismatch. Treat the LLM as the last layer—fix the pipeline first, then upgrade the model.

Q: What's the minimum eval set to debug RAG?

Start with about 100 hand-labeled queries with gold answers and gold evidence spans. Track Recall@k, faithfulness, and citation accuracy. Below this size, metric noise will hide real regressions and ablation studies become unreliable.

Q: How often should I re-audit RAG failure modes?

Run the full seven-mode checklist whenever you change embeddings, the chunker, the re-ranker, or the base LLM. At minimum, re-audit quarterly aligned to the 90-day review cycle for source documents.

Related Articles

comparison

Grounding vs Fact-Checking: What's the Difference in AI Content Workflows?

Grounding anchors AI answers to trusted sources before generation; fact-checking verifies claims after generation. Learn when each belongs in your AI content workflow.

guide

404 Page AI Crawler Handling: Avoiding Citation Loss During Migrations

Migration playbook for keeping AI citations during URL changes — hard 404 vs soft 404, 410 Gone, redirect chains, sitemap cleanup, and refetch monitoring.

specification

Accept-Encoding (Brotli, Gzip) for AI Crawlers

Specification for serving Brotli, gzip, and zstd to AI crawlers via Accept-Encoding negotiation: which bots support which codecs, fallback rules, and Vary handling.

Stay Updated

GEO & AI Search Insights

New articles, framework updates, and industry analysis. No spam, unsubscribe anytime.