RAG chunking strategies compared: fixed, semantic, and hybrid chunking

Fixed-size chunking is fast and simple but can break sentences mid-thought. Semantic chunking splits at meaning boundaries and improves retrieval coherence at the cost of latency and embedding spend. Hybrid chunking layers structural rules with semantic checks and is typically the best default for mixed corpora.

TL;DR: Start with fixed-size chunking (e.g., 256-512 tokens, 10-20% overlap) to get a working baseline. Move to semantic or hybrid chunking when you measure retrieval quality and find your fixed chunks splitting answers across boundaries.

Why chunking matters

In a Retrieval-Augmented Generation (RAG) pipeline, chunking happens during indexing: documents are split into smaller passages, embedded, and stored in a vector database. Retrieval pulls the top-k chunks for a query, and those chunks become the context for the LLM. If chunks are too small, the model lacks context. If they are too large, retrieval becomes coarse and the model wastes tokens. The chunking strategy directly shapes both retrieval precision and answer quality.

The three strategies at a glance

Strategy	How it splits	Pros	Cons	Typical chunk size
Fixed-size	Equal token / character / word counts	Trivial to implement; fast; predictable cost	Can cut sentences; no respect for structure	256-512 tokens
Semantic	At sentence / topic boundaries via embeddings	Coherent passages; better recall on dense answers	Slower; embedding cost at index time	Variable, ~200-800 tokens
Hybrid	Structural splits, then semantic refinement	Robust across heterogeneous documents	More moving parts; needs tuning	Variable, ~256-768 tokens

Fixed-size chunking

Fixed-size chunking divides text into equally sized pieces measured in tokens, characters, or words, often with an overlap (commonly 10-20%) so context isn't lost at boundaries.

Use when:

You're building a baseline.
Documents are uniform (chat logs, support tickets, transcripts).
You need predictable indexing cost and latency.

Watch out for: mid-sentence cuts and split lists. The Weaviate and Pinecone reference posts both note this as the most common quality problem with fixed-size chunking.

Semantic chunking

Semantic chunking embeds short windows (often individual sentences) and creates a new chunk wherever the embedding distance between adjacent sentences exceeds a threshold. Variants include:

Percentile-based — split where similarity drops below a percentile of the sentence-pair distribution.
Double-pass — first pass clusters sentences; second pass merges small or off-topic clusters.
Proposition-based — an LLM rewrites text into atomic propositions, which are then grouped.

Use when:

Documents mix narrative and reference material.
Answers tend to span multiple sentences and need to stay together.
Indexing budget allows the extra embedding calls.

Watch out for: thresholds that produce too many tiny chunks, and high indexing cost at scale.

Hybrid chunking

Hybrid chunking layers two passes:

Structural pass — split on document structure (headings, paragraphs, list boundaries, page breaks).
Semantic pass — within each structural chunk, merge or split based on embedding similarity.

Many teams add a third pass: size guardrails that re-split any chunk above a maximum token count and merge any chunk below a minimum.

Use when:

Corpus mixes long docs (PDFs, articles) and short docs (FAQs, knowledge base articles).
You can afford a small upfront tuning effort.
Retrieval quality matters more than indexing speed.

NVIDIA's evaluation of chunking strategies for RAG found that respecting natural document boundaries (e.g., page-level chunking) was a strong default; hybrid strategies generalize this idea by combining boundaries with semantic checks.

Chunk overlap

Regardless of strategy, overlap copies the last N tokens of one chunk into the start of the next so context isn't lost at the boundary. A 10-20% overlap is a common starting point. Independent retrieval-quality tests (including community benchmarks shared on r/Rag) report meaningful recall lifts from adding modest overlap, especially for dense retrievers.

How to evaluate which strategy wins

Build a small test harness:

Hold-out QA pairs — 50-200 questions with known correct answers from your corpus.
Run each chunking strategy through the same retriever and reranker.
Measure precision@k, recall@k, and end-to-end answer accuracy.
Track indexing time and embedding cost.

The right strategy is the one that maximizes answer accuracy at acceptable cost — not the one with the most sophisticated splitter.

Common mistakes

Tuning chunk size without overlap — overlap usually moves the needle more than chunk size.
Going semantic too early — a tuned fixed-size baseline often beats a poorly tuned semantic chunker.
Ignoring document structure — even simple HTML heading splits beat blind token windows.
Skipping the eval harness — without QA pairs, you're tuning blind.

FAQ

Q: What's a good default chunk size?

256-512 tokens with 10-20% overlap. Tune from there based on retrieval evaluation.

Q: Should I always use semantic chunking?

No. Semantic chunking shines on heterogeneous corpora and dense answers, but it costs more to build and is overkill for short, uniform documents like FAQ entries.

Q: How does chunk overlap interact with chunk size?

Larger chunks need less overlap because context is already inside the chunk. Smaller chunks need more overlap to keep cross-boundary meaning. Treat them as a paired tuning knob.

Q: Does the embedding model affect chunking?

Yes. Higher-context embedding models tolerate longer chunks; smaller models perform best on chunks well below their max input length.

Q: Is 'page-level chunking' a fourth strategy?

It can be considered a structural variant of hybrid chunking — split on page or major section boundaries first, then optionally refine. NVIDIA's published evaluation noted it as a strong default for document-style corpora.

RAG chunking strategies compared: fixed, semantic, and hybrid chunking

Why chunking matters

The three strategies at a glance

Fixed-size chunking

Semantic chunking

Hybrid chunking

Chunk overlap

How to evaluate which strategy wins

Common mistakes

FAQ

Q: What's a good default chunk size?

Q: Should I always use semantic chunking?

Q: How does chunk overlap interact with chunk size?

Q: Does the embedding model affect chunking?

Q: Is 'page-level chunking' a fourth strategy?

Related Articles

404 Page AI Crawler Handling: Avoiding Citation Loss During Migrations

Accept-Encoding (Brotli, Gzip) for AI Crawlers

Accept-Language and AI Language Detection

GEO & AI Search Insights