What is Chunking for RAG
Chunking for RAG is the process of splitting source documents into smaller, retrievable units that are embedded, indexed, and pulled back into an LLM's context window to ground generated answers; the chosen strategy — fixed-size, recursive, semantic, sentence-window, or hierarchical — directly determines citation accuracy, recall, and answer faithfulness.
TL;DR
Chunking splits documents into retrievable units for retrieval-augmented generation. Common strategies are fixed-size, recursive character splitting, semantic, sentence-window, and hierarchical. There is no universal best chunk size; pick one by evaluating retrieval and answer quality on a labeled set, augment chunks with context where possible, and store metadata that lets the LLM cite back to the source.
Definition
Chunking is the preprocessing step in a retrieval-augmented generation (RAG) pipeline that splits source documents into smaller text units — chunks — each of which is embedded into a vector representation, indexed in a vector store, and later retrieved to ground a language model's response. A chunk is the atomic unit of retrieval: it is what the embedding model sees during indexing, what the similarity search returns, and what the LLM ultimately reads to answer a question.
Chunking sits between document loading and embedding in the standard RAG architecture documented by frameworks like LangChain and LlamaIndex. The strategy used determines what the retriever can find and what the generator can faithfully cite. Too small, and chunks lose the context needed to answer. Too large, and chunks dilute relevance signal and waste context window budget. Choosing well is one of the highest-leverage decisions in any RAG system.
Why it matters
Chunking is the single most consequential preprocessing choice in RAG. Three downstream behaviors depend on it:
- Retrieval recall. Embeddings are computed per chunk. If the answer to a question spans two chunks, neither chunk alone may match the query strongly enough to surface. Recall failures here are invisible to the user; the system simply omits relevant information.
- Citation accuracy. When an LLM grounds an answer in retrieved chunks, the chunk is the citation unit. If chunks span unrelated sections, citations become misleading. If chunks are too small, citations point at fragments that lack the context to verify the claim.
- Context window efficiency. Each retrieved chunk consumes context window tokens. Oversized chunks crowd out other relevant chunks. Undersized chunks force the system to retrieve more of them, increasing the risk of off-topic noise and lost-in-the-middle effects.
Chunking also interacts with reranking, hybrid search, and contextual retrieval. A poor chunking strategy cannot be fully fixed by any downstream step; a rerank stage cannot recover information that was never co-located in a chunk. This is why practitioner guides from Pinecone and the Chroma research team treat chunking evaluation as table-stakes.
How it works
A RAG indexing pipeline runs roughly as follows. The chunker sits at the heart of it.
flowchart LR
A["Source documents"] --> B["Loader / parser"]
B --> C["Chunker"]
C --> D["Embedding model"]
D --> E["Vector store"]
F["User query"] --> G["Query embedding"]
G --> H["Similarity search"]
E --> H
H --> I["Reranker (optional)"]
I --> J["LLM with retrieved chunks"]
J --> K["Grounded answer + citations"]The chunker takes parsed text and emits a sequence of Chunk objects, each carrying:
- text: the chunk content
- metadata: source document id, page or section number, position offsets, headings, and any extracted entities
- optional parent_id or section_id for hierarchical retrieval
Four design axes determine chunker behavior:
- Boundary policy. Where chunks are allowed to break. Character offsets, token offsets, sentence boundaries, paragraph boundaries, semantic shifts, or document structure (headings, list items, code blocks).
- Chunk size. Target length, usually expressed in tokens (often 256-1,024) or characters. The sweet spot depends on the embedding model's context window and the typical question scope.
- Overlap. A small slice of content shared between consecutive chunks (commonly 10-20% of chunk size) that preserves continuity across boundaries and reduces missed-context failures.
- Augmentation. Optional metadata or context prepended to each chunk before embedding — document title, section heading, summary, or, in Anthropic's contextual retrieval pattern, a one-paragraph contextual description that situates the chunk in the broader document.
Good chunkers respect document structure first and chunk size second. The recursive character splitter popularized by LangChain captures this idea: try paragraph breaks first, then sentence breaks, then word breaks, only falling back to hard character cuts when necessary.
Comparison vs related strategies
| Strategy | Boundary policy | Best for | Tradeoff |
|---|---|---|---|
| Fixed-size (character or token) | Hard cut every N units | Uniform corpora, baselines | Often breaks mid-sentence; weak structure awareness |
| Recursive character | Paragraph → sentence → word fallback | General-purpose, mixed content | Tunable; sensitive to separator choice |
| Semantic | Embedding-similarity boundaries | Long-form prose, narrative docs | Higher preprocessing cost; noisier with technical text |
| Sentence-window | One sentence per chunk + neighbor window at retrieval | Q&A over dense text | Requires window-aware retrieval logic |
| Hierarchical (parent-child) | Small chunks for retrieval, larger parents for context | Mixed-granularity questions | More complex storage and join logic |
| Document-structure | H1/H2/H3 sections, code blocks, tables | Technical docs, MDX, source code | Requires reliable parsing of source format |
| Contextual retrieval | Recursive + per-chunk context prefix | High-stakes citation accuracy | Adds LLM cost at indexing time |
Fixed-size and recursive splitters are the most common defaults. Semantic chunking has gained traction for narrative content. Sentence-window and hierarchical strategies are popular in LlamaIndex when answers commonly require both pinpoint precision and surrounding context. Anthropic's contextual retrieval, published in late 2024, layers chunk-level context prefixes on top of any base strategy and reports substantial reductions in retrieval failures on standard benchmarks.
Practical application
A defensible chunking choice follows a short, repeatable workflow. Treat chunking as an experiment, not a guess.
- Characterize the corpus. Document length distribution, structural cues (headings, lists, tables), language, and the presence of code or formulas. A corpus of long PDFs without headings asks for different choices than a corpus of well-structured Markdown.
- Characterize the queries. Are users asking pinpoint factual questions, multi-hop synthesis questions, or summary questions? Pinpoint questions reward smaller chunks plus reranking; synthesis questions reward larger chunks or hierarchical retrieval.
- Pick a baseline. Start with a recursive character splitter at 512 tokens with 10% overlap. This baseline covers most general-purpose use cases and gives you a reference to beat.
- Build an evaluation set. Curate 50-200 query-answer pairs with the gold passages identified. This set is the only honest way to compare strategies.
- Sweep parameters. Vary chunk size (256, 512, 1,024), overlap (0%, 10%, 20%), and boundary policy (recursive vs semantic vs sentence-window). Measure retrieval recall@k and end-to-end answer faithfulness.
- Layer augmentation. Once a baseline strategy wins, test contextual retrieval, parent-child retrieval, and metadata enrichment on top of it.
- Lock in metadata. Make sure every chunk carries source_url, section_path, and position so the LLM can cite back precisely.
- Re-evaluate quarterly. Embedding models, rerankers, and corpus content drift. Lock in a regression test so you know when your chunking strategy needs to change.
For AI citation use cases (where the goal is reliable grounded references), prefer recursive or document-structure chunking with contextual retrieval, plus a reranker, over aggressive semantic chunking without evaluation.
Examples
Five concrete chunking patterns and where they shine:
- Help center articles, recursive 512/64. Articles already have structure (H2/H3 sections, ordered lists). A recursive splitter at 512 tokens with 64-token overlap respects paragraph boundaries and yields chunks that map cleanly to user questions. Citation accuracy is high because each chunk usually corresponds to a single instruction or concept.
- Long PDFs (legal, scientific), hierarchical parent-child. Index small child chunks (256 tokens) for retrieval precision, but return the parent chunk (1,024-2,048 tokens) when sending to the LLM. This pattern is built into LlamaIndex's HierarchicalNodeParser and answers both pinpoint and synthesis questions.
- Codebases, document-structure (function/class). Chunk source code by symbol — one chunk per function, class, or module — plus a header chunk per file. Embed with a code-aware embedding model. This pattern dominates code-search RAG because functions are the natural unit of meaning.
- Customer call transcripts, sentence-window. Each sentence is a chunk; at retrieval time, return the matched sentence plus three sentences before and after. This pattern excels at locating exact phrasing while still returning enough surrounding text for the LLM to interpret tone and outcome.
- Marketing or narrative blogs, semantic. When prose flows without strong structural cues, a semantic chunker that splits on embedding-similarity drops produces chunks aligned with topic shifts. Pair it with a reranker so off-topic neighbors do not crowd the context window.
- Mixed enterprise corpus (the most common case), recursive + contextual retrieval. Use a recursive splitter at 512/64 as the baseline and layer Anthropic's contextual retrieval to prepend each chunk with a one-sentence description generated by a cheap LLM. Reported gains in retrieval recall justify the indexing cost in most production deployments.
Common mistakes
- Picking a chunk size by intuition without an evaluation set. The default 512 tokens may be wrong for your corpus.
- Ignoring overlap. Zero overlap looks tidy but causes retrieval to miss boundary-spanning answers.
- Chunking before parsing structure. Splitting raw HTML or PDF text before extracting headings, tables, and code blocks throws away the strongest boundary signals.
- Forgetting metadata. Without source_url, section_path, and offsets, the LLM cannot cite back accurately even when retrieval works.
- Stripping headings. Removing H1/H2 text before chunking breaks the most useful semantic anchors and degrades retrieval recall.
- Treating chunking as set-and-forget. Embedding model upgrades and corpus drift change the optimal strategy; rerun the evaluation periodically.
- Aggressive semantic chunking without verification. Embedding-similarity boundaries can be noisy on technical or list-heavy text and sometimes underperform a plain recursive splitter.
FAQ
Q: What is the best chunk size for RAG?
There is no universal best size. A common starting point is 512 tokens with 10-20% overlap, but the right answer depends on your corpus, embedding model, and question scope. Always validate on a labeled evaluation set before committing.
Q: Should chunks always overlap?
Usually yes. A small overlap (10-20% of chunk size) reduces the chance that an answer spanning a chunk boundary becomes unfindable. Zero overlap is acceptable only when chunks are guaranteed to be self-contained, such as one function per chunk in code search.
Q: Is semantic chunking always better than fixed-size?
No. Semantic chunking helps with narrative prose but can underperform on technical, list-heavy, or tabular content. Treat it as one option among several and verify with evaluation.
Q: How does chunking affect citation accuracy?
Chunks are the citation unit. If a chunk mixes unrelated sections, citations become misleading; if a chunk is too small, citations point at fragments that cannot be independently verified. Cleaner boundaries and metadata produce cleaner citations.
Q: What is contextual retrieval?
Contextual retrieval, popularized by Anthropic in 2024, prepends each chunk with a one-sentence description that situates the chunk within the broader document before embedding. It tends to reduce retrieval failures relative to plain chunking, at the cost of additional indexing-time compute.
Q: Should I chunk tables and code blocks like text?
No. Tables should be preserved as a single chunk where possible, optionally with a paired summary chunk. Code blocks should be chunked at function or class boundaries with a code-aware embedding model.
Q: How often should I re-evaluate my chunking strategy?
Quarterly at minimum, and whenever you upgrade the embedding model, change the reranker, or significantly expand the corpus. Keep the evaluation set under version control so regressions are visible.
Related Articles
What is Context Window Engineering
Context window engineering is the discipline of curating, ordering, and budgeting tokens in an LLM's context to maximize accuracy and minimize hallucinations.
What Is RAG (Retrieval-Augmented Generation)
RAG (retrieval-augmented generation) pairs a retriever and an LLM so answers are grounded in fresh, citable sources rather than the model's parametric memory alone.
What is Reranking for AI Search
Reranking refines retrieval results before grounding by scoring query-document pairs with a cross-encoder, sharply improving citation accuracy in RAG.