What is query fan-out? Optimizing multi-query retrieval for RAG

Query fan-out is the technique of generating multiple sub-queries from one user question, retrieving for each in parallel, and merging the results before generation. It increases recall on broad, ambiguous, or multi-hop queries but multiplies cost and latency by the fan-out factor. Tune fan-out with a router that triggers it only when needed, a capped fan-out factor, reciprocal rank fusion to merge candidates, and explicit recall + faithfulness measurement.

TL;DR

Fan-out helps when one query cannot capture everything the user is asking. It hurts when you apply it indiscriminately: cost, latency, and reranker work all scale with the number of sub-queries. The right policy is a router that decides per query whether to fan out, a capped fan-out factor (typically 3-5), parallel retrieval with cached embeddings, dedupe + reciprocal rank fusion to merge results, and a small evaluation set that measures answer recall and citation faithfulness with and without fan-out enabled.

Definition

Query fan-out is a retrieval pattern where the system rewrites or decomposes a single user query into multiple sub-queries, runs retrieval on each in parallel, and merges the results into a single candidate set used by the answer generator. The number of sub-queries is the fan-out factor.

Variants you will see in the literature:

Query rewriting (paraphrase fan-out). Generate paraphrased versions of the same query (LangChain MultiQueryRetriever).
Query decomposition. Break a multi-hop question into atomic sub-questions, each retrievable independently.
HyDE (Hypothetical Document Embeddings). Generate a hypothetical answer first, embed that, and retrieve.
Query2Doc. Generate pseudo-documents that expand the query before embedding.
Multi-source fan-out. Same query, different stores or indices (vector, BM25, knowledge graph, web).

Google AI Mode publicly describes a fan-out + multi-step retrieval architecture for complex queries; the same general pattern is now standard in production RAG.

When to fan out (and when not to)

Fan-out is not free. Use it only when the query is likely to under-retrieve without it.

Fan out when:

The query contains conjunctions or comparisons ("compare X and Y", "differences between A and B").
The query has multiple sub-questions implicitly ("what is X, why does it matter, and how do I implement it").
The query is ambiguous and could match different document subspaces (acronyms, polysemous terms).
The corpus has heterogeneous phrasing for the same concept and a single embedding misses synonyms.
The downstream answer must cite multiple sources (e.g., AI search answers that cite 3-8 distinct sources).

Skip fan-out when:

The query is a single, well-formed factual lookup ("capital of France", "price of SKU 1234").
The corpus is small and dense; one query usually retrieves the relevant chunks.
Latency budget is tight (e.g., chat-style sub-second SLAs without async streaming).
You already have a strong reranker that handles paraphrase variation well.

A cheap router — a small classifier or a few-shot LLM call — makes the per-query decision and is usually itself a sub-100 ms operation.

The cost and latency math

For a fan-out factor k:

Embedding cost ≈ k × single-query embedding cost (mostly negligible for small models).
Retrieval cost ≈ k × single-retrieval latency, parallelisable across k workers; effective latency ≈ max(retrieval latency).
Rewrite/decomposition cost ≈ 1 LLM call to produce all k sub-queries (~100-400 ms with a small/fast model).
Reranker cost ≈ reranker work scales with the merged candidate-set size, often ~k × baseline before deduplication, less after.
Generation cost is unchanged in tokens, but context-quality changes.

Rule of thumb: fan-out at k=3-5 adds ~150-400 ms of end-to-end latency (rewrite + parallel retrieval + reranking) and roughly 1.5-3× the retrieval-side cost. Anything above k=8 is rarely justified by quality gains.

Dedupe and merge

With multiple sub-queries you will see overlapping candidates. The naive merge (concat + take top-N by score) underweights documents that appear in multiple sub-queries' results, which is exactly the wrong signal.

Use one of:

Reciprocal rank fusion (RRF). Score each document by Σ 1 / (k + rank_i) across sub-query result lists. Robust, parameter-light, widely used.
Score normalisation + sum. Min-max normalise per-list scores, then sum across sub-queries.
Reranker over the union. Take the union, deduplicate by chunk ID, send to a cross-encoder reranker, take top-N.

Deduplicate before reranking, not after. Reranker cost scales with candidate-set size; running it on duplicates wastes compute.

Routing strategy

A production-grade fan-out pipeline has at least three decisions encoded as a router:

Should we fan out at all? Binary classifier or few-shot LLM, output: yes/no.
What kind of fan-out? Paraphrase, decomposition, HyDE, multi-source.
What fan-out factor? Typically a fixed cap (e.g., 4) with the actual count chosen by the rewriter.

Keep these decisions inspectable. Log the router's verdict and the generated sub-queries on every call so you can audit failures and tune thresholds.

Measuring fan-out impact

Do not ship fan-out without measurement; the failure modes are subtle.

Answer recall@k. On a test set with known gold sources, what fraction of those sources appears in the merged candidate set with and without fan-out?
Citation faithfulness. When the system cites a source, is the cited claim actually supported by that source? Tools like RAGAS, FACTS Grounding, and HHEM provide automated scoring; manual review is still required at the margins.
Answer recall change. Compare the same questions with and without fan-out enabled; expect single-digit-percent gains on broad queries and minor regressions on simple lookups.
Latency p50 and p95. Fan-out is a tail-latency feature; the p95 matters more than the p50.
Cost per answer. Track total $ per answered query; fan-out should pay for itself in answer-quality KPIs.

A practical evaluation cadence: weekly on a 100-300 query golden set, with the router enabled and disabled side by side.

Common pitfalls

Fanning out everything. The single biggest cost mistake; route, don't blanket-apply.
Setting k too high. Quality gains plateau quickly; cost and latency do not.
No deduplication. Repeated chunks crowd the context window and inflate reranker cost.
Ignoring multi-hop structure. For multi-hop questions, paraphrase fan-out is weaker than decomposition; pick the right variant.
No router visibility. If you cannot see the sub-queries the rewriter produced, you cannot debug bad answers.
Reranker without fan-out tuning. A strong reranker masks fan-out problems until you change models; build the eval harness early.

Reference policy

A defensible default policy for production RAG:

Route every query through a binary classifier; fan out only when it predicts the query is broad, ambiguous, or multi-hop.
Cap fan-out factor at 4. Allow the rewriter to emit fewer sub-queries when the query is tight.
Run sub-queries in parallel against the same vector index; add a BM25 sub-query for entity-heavy questions.
Deduplicate by chunk ID, then merge with reciprocal rank fusion (k=60).
Send the top 30-60 deduped chunks to a cross-encoder reranker; pass the top 6-10 to the generator.
Log: original query, router verdict, sub-queries, per-sub-query top-K, merged top-N, cited sources.
Evaluate weekly with answer recall and citation faithfulness on a frozen golden set.

FAQ

Q: How is query fan-out different from query expansion?

Query expansion historically meant adding synonyms or related terms to a single query. Fan-out runs retrieval multiple times with multiple distinct queries and merges the results, which generally outperforms simple expansion on broad or multi-hop questions.

Q: Does fan-out help with hallucination?

Indirectly. Better recall reduces the chance the generator answers from parametric knowledge alone. But fan-out can also retrieve more weakly relevant chunks, so faithfulness scoring and a strong reranker matter at least as much.

Q: When should we use HyDE instead of paraphrase fan-out?

HyDE works well when the query is short and the corpus is rich in detail; embedding a hypothetical answer surfaces semantically similar long-form chunks. Use paraphrase fan-out when the query already contains the entities and you mainly need to handle wording variation.

Q: Does fan-out require an LLM at retrieval time?

Usually yes, for the rewriter or decomposer. You can amortise cost with a small fast model and aggressive caching of common sub-queries.

Q: How does this interact with ranking in AI Overviews?

Google AI Mode publicly uses a fan-out + multi-step retrieval pattern for complex queries, which is one reason content that supports multi-hop questions tends to surface across more cited sources. Tuning your own RAG with fan-out approximates that behaviour for internal applications.