What Is RAG (Retrieval-Augmented Generation)

Retrieval-augmented generation (RAG) is an architecture where a language model answers a question by first retrieving relevant passages from an external source, then conditioning its generation on those passages. RAG reduces hallucinations, lets answers cite specific sources, and keeps knowledge fresh without retraining the model.

TL;DR

A RAG system has two halves: a retriever that finds relevant passages from a corpus (vectors, BM25, or both), and a generator (an LLM) that writes the answer using those passages as context. RAG was introduced by Lewis et al. at Meta AI in 2020 and is now the default architecture for ChatGPT search, Claude with project knowledge, Perplexity, Google AI Overviews, and almost every "chat with your docs" product. If you want AI engines to cite your content, you need to be retrievable by the RAG step that sits in front of the LLM.

Definition

Retrieval-augmented generation (RAG) is a technique that augments a generative language model with a non-parametric memory — typically a searchable index of documents, passages, or facts — so that the model conditions its output on retrieved evidence rather than relying solely on what was baked into its weights during training.

The term was introduced by Patrick Lewis and colleagues at Meta AI in the 2020 NeurIPS paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". The paper combined a pre-trained seq2seq generator (BART) with a dense retriever (DPR) over a Wikipedia vector index, and showed that the resulting models produced "more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline" on open-domain question answering.

In modern usage "RAG" is broader than the original architecture. Today it refers to any pipeline where:

A user query triggers a search against an external knowledge source.
The top-K retrieved passages are inserted into the LLM's prompt as context.
The LLM generates an answer that is expected to be grounded in (and ideally cite) those passages.

The knowledge source can be a vector index of company docs, the open web, a SQL database, a knowledge graph, or any combination. The generator can be GPT-4, Claude, Gemini, Llama, or a local model. The defining property is the explicit retrieval step that supplies fresh, source-linked context at inference time — separating what the model knows from what the system can look up.

Why it matters

Language models have two well-known weaknesses that RAG directly addresses.

Hallucinations. A pure parametric model fills in plausible-sounding text when it does not know the answer, because that is what its training objective rewards. The Lewis et al. paper explicitly cites this as a motivation: parametric memory is hard to update, hard to inspect, and "may produce hallucinations." Retrieving real passages and asking the model to cite them is the most reliable known mitigation.

Stale knowledge. Pre-training cuts off at a fixed date. Anything that happened, changed, or was published after that date is invisible to the model. RAG sidesteps this by reading from a freshly indexed corpus at query time — the same reason ChatGPT's browse mode, Perplexity, and Google AI Overviews can answer questions about today's news.

For practitioners, RAG matters for three concrete reasons:

Citations are a retrieval problem first. An AI engine can only cite what its retriever surfaces. Content that isn't retrieved isn't cited — no matter how authoritative or how well-ranked it would be in a traditional SERP.
Freshness is a competitive advantage. RAG-based engines reward sources that update on a regular cadence and signal that updates clearly (modified dates, version notes, changelogs).
Domain-specific accuracy without retraining. Enterprises deploy RAG over internal docs to get domain answers without paying for fine-tuning or risking proprietary data leakage into model weights.

For builders, RAG matters because it shifts the engineering surface from prompt-only to a retrieval pipeline whose recall, precision, and latency now dominate user-perceived quality. A great LLM with poor retrieval produces confidently wrong answers; a modest LLM with great retrieval produces useful, citable answers.

How it works

A modern RAG pipeline has five logical stages: ingestion, retrieval, augmentation, generation, and evaluation. The diagram below shows the runtime path of a query.

flowchart LR
    Q["User query"] --> E1["Query encoder"]
    E1 --> R["Retriever
(vector + BM25)"]
    KB["Knowledge base
(docs, web, KG)"] --> I["Ingestion
(chunk + embed)"]
    I --> VDB["Vector index
(HNSW / IVF)"]
    VDB --> R
    R --> RR["Reranker
(cross-encoder)"]
    RR --> P["Top-K passages"]
    P --> A["Prompt augmentation"]
    Q --> A
    A --> G["LLM generator"]
    G --> ANS["Answer + citations"]

1. Ingestion

Documents are split into chunks (typically 200-800 tokens with 50-100 token overlap), each chunk is embedded with an encoder model, and the resulting vectors are written to a vector index alongside metadata (URL, title, section, last-updated). Most systems also build a parallel BM25 index for hybrid retrieval.

2. Retrieval

At query time, the user's question is embedded with the same model used for the corpus, the index returns the top-N nearest passages by cosine similarity (or a hybrid score with BM25), and a metadata filter narrows by source, freshness, or access permission. Anthropic's Contextual Retrieval work shows that combining BM25 and embeddings is now standard because the two methods catch different failure modes.

3. Reranking (optional but common)

A cross-encoder reranker scores each candidate passage against the query directly, producing a smaller top-K with much higher precision. Rerankers are slower than bi-encoders but only need to score 50-200 candidates, so the latency cost is manageable.

4. Augmentation

The top-K passages are formatted into the prompt with delimiters, source IDs, and instructions like "answer using only the passages below; cite source IDs." This is where prompt engineering meets retrieval engineering — small format changes (numbered citations, structured XML tags, explicit "if not in context, say so" instructions) measurably improve answer faithfulness.

5. Generation

The LLM produces an answer conditioned on the augmented prompt. The original Lewis et al. paper compared RAG-Sequence (one retrieved set used across the whole answer) with RAG-Token (different passages retrieved per token); production systems overwhelmingly use the simpler RAG-Sequence pattern with a single retrieval call per turn.

6. Evaluation

Production RAG includes offline eval (recall@K, MRR, faithfulness, answer correctness) and online eval (CTR, thumbs feedback, citation click-through). Without evaluation, regressions in chunking, embedding models, or prompts go unnoticed until users complain.

RAG vs fine-tuning vs answer grounding

RAG, fine-tuning, and answer grounding solve overlapping problems with different trade-offs.

Property	RAG	Fine-tuning	Answer grounding (citations only)
Knowledge update cost	Re-index documents	Re-train the model	Trivial — just re-crawl
Citation support	Native (cite retrieved passages)	None	Native
Domain adaptation	High via corpus	High via training data	Low
Hallucination risk	Reduced if retrieval is good	Reduced for in-distribution queries	Moderate — depends on grounding strictness
Compute cost at training	Low	High	None
Compute cost at inference	Higher (retrieval call)	Same as base model	Higher (web fetch)
Best for	Fresh, citable, evolving knowledge	Style, format, narrow tasks	Open-web Q&A
Examples	ChatGPT search, Claude project knowledge	A code-completion model fine-tuned on internal repos	Browse-only mode without a vector store

The practical guidance most teams converge on:

Use RAG when the answer depends on facts that change, on domain-specific docs the base model has not seen, or when citations are required.
Use fine-tuning for stable behaviors (tone, structured output, narrow classification) where retrieving examples each time would be wasteful.
Use answer grounding — a lightweight cousin of RAG that fetches a few URLs at query time without a persistent index — when you cannot run an indexing pipeline but want fresher-than-training-cutoff coverage.

Most mature production stacks combine all three: a fine-tuned generator for tone and structure, RAG over internal docs for domain knowledge, and live web grounding for the long tail of open questions.

Practical applications

RAG is the architectural backbone behind a growing list of production systems.

1. AI search engines (ChatGPT, Claude, Perplexity, Google AI Overviews)

When you ask Perplexity a question, it retrieves a small set of web pages, ranks them, and asks an LLM to write an answer with inline citations. ChatGPT's browse and search modes follow the same pattern; Google AI Overviews adds Google's own ranking signals to the retrieval step. Every citation a user sees was produced by a RAG-style retrieval step.

2. "Chat with your docs" enterprise assistants

Claude Projects with retrieval enabled is a canonical example. Anthropic's help-center documentation describes the pattern: Claude "uses a project knowledge search tool to retrieve relevant information from your uploaded documents" instead of loading everything into context. The same architecture powers Notion AI, Glean, and a long tail of internal copilots.

3. Customer-support copilots

Support bots ground their answers in product docs, knowledge-base articles, and past tickets. RAG lets them cite the exact KB article that supports each step, which makes deflection metrics auditable and reduces escalations driven by hallucinated solutions.

4. Internal Q&A and policy lookup

Finance, legal, HR, and security teams run RAG over policy PDFs, contracts, and SOC 2 controls. The win is twofold: faster lookups and built-in provenance for compliance reviews.

5. Developer assistants over private code and docs

Private-repo Q&A tools (Cody, GitHub Copilot Workspace, Cursor's docs feature) use RAG to surface relevant code, design docs, and ADRs at the right moment in a developer's workflow. Retrieval over codebases pairs especially well with structure-aware chunking (functions, classes) and language-specific embeddings.

6. Multimodal RAG

OpenAI's cookbook has examples of RAG over images using GPT-4o vision plus a vector store, demonstrating that the retrieval-then-generate pattern extends naturally to charts, diagrams, and screenshots. Multimodal RAG is the foundation of emerging "chat with your dashboards" products.

A single useful checklist applies across all six: pick a strong embedding model, tune chunking, add a reranker before generation, instrument retrieval-quality evaluation, and never let the LLM answer without retrieved context for queries that require domain knowledge.

Examples of RAG systems

Five concrete RAG implementations to anchor the concept:

Lewis et al. 2020 (the original). BART generator + DPR retriever over a Wikipedia dump, fine-tuned end-to-end. Set state-of-the-art on three open-domain QA benchmarks. The reference architecture every modern RAG system traces back to.

Perplexity. Live web retrieval over a freshly crawled index, plus a custom answer model. Citations are the product surface; retrieval quality is what users actually grade Perplexity on.

ChatGPT search & file search. OpenAI's Responses API exposes a file_search tool and a web_search tool. The cookbook example "Multi-Tool Orchestration with RAG approach" shows how queries are routed across an internal vector store (Pinecone) and live web search depending on the question.

Claude Projects with retrieval. Anthropic's help center describes the project-knowledge search tool that turns uploaded docs into a retrievable corpus, allowing "up to 10x more content" than fits in a single context window while maintaining response quality.

Anthropic Contextual Retrieval. A 2024 Anthropic recipe that prepends a short LLM-generated context summary to each chunk before embedding, then retrieves with a BM25 + embeddings hybrid. Anthropic reports significant retrieval-quality gains over naive chunking, and the technique has been widely adopted.

Reference stacks built on these patterns include LangChain, LlamaIndex, Haystack, and the cloud RAG offerings from AWS Bedrock, Azure AI Search, and Vertex AI Search. Each one packages the same five-stage pipeline with different defaults.

Common mistakes

Five failure modes that show up repeatedly in production RAG.

Naive chunking. Splitting on a fixed character count regardless of structure breaks sentences, separates headings from their content, and loses context. Structure-aware chunking (heading-aware, sentence-bounded, with overlap) is one of the highest-leverage fixes available.
Mixing embedding models. Embedding the corpus with one model and queries with another silently destroys recall. Standardize on a single embedding model per index and re-embed everything when you upgrade.
Skipping the reranker. Vector recall@50 is high; vector precision@5 often is not. A cross-encoder reranker over the top-50 typically improves answer faithfulness more than swapping the LLM.
No retrieval-quality eval. Teams measure end-to-end answer quality but cannot tell whether regressions came from retrieval, prompts, or the LLM. Track recall@K and MRR on a labeled query set as a first-class metric.
Prompt leakage and context bleed. Without explicit instructions like "answer only from the passages below; if not present, say you do not know," models default to their parametric memory and contradict the retrieved context. Anthropic and OpenAI both publish prompt templates for grounded answering; use them.

A bonus mistake worth flagging: shipping a RAG system without freshness signals. If your indexer updates weekly but the corpus changes daily, the LLM happily cites stale content. Aligning ingestion cadence with content cadence is part of the RAG contract, not a backlog item.

FAQ

Q: What is RAG in one sentence?

RAG is an architecture where a language model answers a query by first retrieving relevant passages from an external knowledge source and then generating a response conditioned on those passages, enabling fresh and citable answers.

Q: When should I use RAG instead of fine-tuning?

Use RAG when the answer depends on facts that change frequently, on documents the base model has not seen, or when citations are required. Use fine-tuning for stable behaviors like tone, format, and narrow classification. The two are complementary — most production systems use RAG for knowledge and fine-tuning for behavior.

Q: Does RAG eliminate hallucinations?

No, but it materially reduces them when retrieval is good and prompts instruct the model to answer only from the retrieved context. Hallucinations remaining after RAG usually trace back to retrieval misses (the right passage was not retrieved) or weak grounding instructions in the prompt.

Q: What's the difference between RAG and answer grounding?

Answer grounding is the broader idea of attributing answers to sources at generation time. RAG is one specific way to do it: build an index, retrieve at query time, and condition the LLM on retrieved passages. Lightweight grounding can also be done by fetching a few URLs live without a persistent index.

Q: How big does my corpus need to be for RAG?

RAG is useful from dozens of documents up to billions of passages. Below ~100,000 chunks an in-memory store or pgvector is typically enough; above that, a dedicated vector database (Pinecone, Weaviate, Qdrant, Milvus, Vespa) earns its keep with managed ANN, metadata filtering, and hybrid retrieval.

Q: Will AI search engines retrieve my content for their RAG pipelines?

They will if your content is crawlable, well-structured, and matches the queries users send. AI engines run their own retrieval over their own indexes — you cannot upload vectors to them. The implication is that on-page clarity, passage-level structure, and entity-rich content are your levers for retrieval inclusion.

Q: How do I measure RAG quality?

Use a labeled set of representative queries with ground-truth passages or answers. Track retrieval metrics (recall@K, MRR, nDCG) separately from generation metrics (faithfulness, answer correctness, citation accuracy). End-to-end metrics alone hide whether regressions are in retrieval or generation.

Q: What's contextual retrieval?

A technique introduced by Anthropic in 2024 that prepends an LLM-generated context summary (50-100 tokens) to each chunk before embedding, then retrieves with a BM25 + embeddings hybrid. Anthropic reports substantial retrieval-quality improvements, and the pattern has been widely adopted as a drop-in upgrade for naive chunking.