What is query fan-out? Optimizing multi-query retrieval for RAG
Query fan-out is the technique of generating multiple sub-queries from one user question, retrieving for each in parallel, and merging the results before generation. It increases recall on broad, ambiguous, or multi-hop queries but multiplies cost and latency by the fan-out factor. Tune fan-out with a router that triggers it only when needed, a capped fan-out factor, reciprocal rank fusion to merge candidates, and explicit recall + faithfulness measurement.
TL;DR
Fan-out helps when one query cannot capture everything the user is asking. It hurts when you apply it indiscriminately: cost, latency, and reranker work all scale with the number of sub-queries. The right policy is a router that decides per query whether to fan out, a capped fan-out factor (typically 3-5), parallel retrieval with cached embeddings, dedupe + reciprocal rank fusion to merge results, and a small evaluation set that measures answer recall and citation faithfulness with and without fan-out enabled.
Definition
Query fan-out is a retrieval pattern where the system rewrites or decomposes a single user query into multiple sub-queries, runs retrieval on each in parallel, and merges the results into a single candidate set used by the answer generator. The number of sub-queries is the fan-out factor.
Variants you will see in the literature:
- Query rewriting (paraphrase fan-out). Generate paraphrased versions of the same query (LangChain MultiQueryRetriever).
- Query decomposition. Break a multi-hop question into atomic sub-questions, each retrievable independently.
- HyDE (Hypothetical Document Embeddings). Generate a hypothetical answer first, embed that, and retrieve.
- Query2Doc. Generate pseudo-documents that expand the query before embedding.
- Multi-source fan-out. Same query, different stores or indices (vector, BM25, knowledge graph, web).
Google AI Mode publicly describes a fan-out + multi-step retrieval architecture for complex queries; the same general pattern is now standard in production RAG.
When to fan out (and when not to)
Fan-out is not free. Use it only when the query is likely to under-retrieve without it.
Fan out when:
- The query contains conjunctions or comparisons ("compare X and Y", "differences between A and B").
- The query has multiple sub-questions implicitly ("what is X, why does it matter, and how do I implement it").
- The query is ambiguous and could match different document subspaces (acronyms, polysemous terms).
- The corpus has heterogeneous phrasing for the same concept and a single embedding misses synonyms.
- The downstream answer must cite multiple sources (e.g., AI search answers that cite 3-8 distinct sources).
Skip fan-out when:
- The query is a single, well-formed factual lookup ("capital of France", "price of SKU 1234").
- The corpus is small and dense; one query usually retrieves the relevant chunks.
- Latency budget is tight (e.g., chat-style sub-second SLAs without async streaming).
- You already have a strong reranker that handles paraphrase variation well.
A cheap router — a small classifier or a few-shot LLM call — makes the per-query decision and is usually itself a sub-100 ms operation.
The cost and latency math
For a fan-out factor k:
- Embedding cost ≈ k × single-query embedding cost (mostly negligible for small models).
- Retrieval cost ≈ k × single-retrieval latency, parallelisable across k workers; effective latency ≈ max(retrieval latency).
- Rewrite/decomposition cost ≈ 1 LLM call to produce all k sub-queries (~100-400 ms with a small/fast model).
- Reranker cost ≈ reranker work scales with the merged candidate-set size, often ~k × baseline before deduplication, less after.
- Generation cost is unchanged in tokens, but context-quality changes.
Rule of thumb: fan-out at k=3-5 adds ~150-400 ms of end-to-end latency (rewrite + parallel retrieval + reranking) and roughly 1.5-3× the retrieval-side cost. Anything above k=8 is rarely justified by quality gains.
Dedupe and merge
With multiple sub-queries you will see overlapping candidates. The naive merge (concat + take top-N by score) underweights documents that appear in multiple sub-queries' results, which is exactly the wrong signal.
Use one of:
- Reciprocal rank fusion (RRF). Score each document by Σ 1 / (k + rank_i) across sub-query result lists. Robust, parameter-light, widely used.
- Score normalisation + sum. Min-max normalise per-list scores, then sum across sub-queries.
- Reranker over the union. Take the union, deduplicate by chunk ID, send to a cross-encoder reranker, take top-N.
Deduplicate before reranking, not after. Reranker cost scales with candidate-set size; running it on duplicates wastes compute.
Routing strategy
A production-grade fan-out pipeline has at least three decisions encoded as a router:
- Should we fan out at all? Binary classifier or few-shot LLM, output: yes/no.
- What kind of fan-out? Paraphrase, decomposition, HyDE, multi-source.
- What fan-out factor? Typically a fixed cap (e.g., 4) with the actual count chosen by the rewriter.
Keep these decisions inspectable. Log the router's verdict and the generated sub-queries on every call so you can audit failures and tune thresholds.
Measuring fan-out impact
Do not ship fan-out without measurement; the failure modes are subtle.
- Answer recall@k. On a test set with known gold sources, what fraction of those sources appears in the merged candidate set with and without fan-out?
- Citation faithfulness. When the system cites a source, is the cited claim actually supported by that source? Tools like RAGAS, FACTS Grounding, and HHEM provide automated scoring; manual review is still required at the margins.
- Answer recall change. Compare the same questions with and without fan-out enabled; expect single-digit-percent gains on broad queries and minor regressions on simple lookups.
- Latency p50 and p95. Fan-out is a tail-latency feature; the p95 matters more than the p50.
- Cost per answer. Track total $ per answered query; fan-out should pay for itself in answer-quality KPIs.
A practical evaluation cadence: weekly on a 100-300 query golden set, with the router enabled and disabled side by side.
Common pitfalls
- Fanning out everything. The single biggest cost mistake; route, don't blanket-apply.
- Setting k too high. Quality gains plateau quickly; cost and latency do not.
- No deduplication. Repeated chunks crowd the context window and inflate reranker cost.
- Ignoring multi-hop structure. For multi-hop questions, paraphrase fan-out is weaker than decomposition; pick the right variant.
- No router visibility. If you cannot see the sub-queries the rewriter produced, you cannot debug bad answers.
- Reranker without fan-out tuning. A strong reranker masks fan-out problems until you change models; build the eval harness early.
Reference policy
A defensible default policy for production RAG:
- Route every query through a binary classifier; fan out only when it predicts the query is broad, ambiguous, or multi-hop.
- Cap fan-out factor at 4. Allow the rewriter to emit fewer sub-queries when the query is tight.
- Run sub-queries in parallel against the same vector index; add a BM25 sub-query for entity-heavy questions.
- Deduplicate by chunk ID, then merge with reciprocal rank fusion (k=60).
- Send the top 30-60 deduped chunks to a cross-encoder reranker; pass the top 6-10 to the generator.
- Log: original query, router verdict, sub-queries, per-sub-query top-K, merged top-N, cited sources.
- Evaluate weekly with answer recall and citation faithfulness on a frozen golden set.
FAQ
Q: How is query fan-out different from query expansion?
Query expansion historically meant adding synonyms or related terms to a single query. Fan-out runs retrieval multiple times with multiple distinct queries and merges the results, which generally outperforms simple expansion on broad or multi-hop questions.
Q: Does fan-out help with hallucination?
Indirectly. Better recall reduces the chance the generator answers from parametric knowledge alone. But fan-out can also retrieve more weakly relevant chunks, so faithfulness scoring and a strong reranker matter at least as much.
Q: When should we use HyDE instead of paraphrase fan-out?
HyDE works well when the query is short and the corpus is rich in detail; embedding a hypothetical answer surfaces semantically similar long-form chunks. Use paraphrase fan-out when the query already contains the entities and you mainly need to handle wording variation.
Q: Does fan-out require an LLM at retrieval time?
Usually yes, for the rewriter or decomposer. You can amortise cost with a small fast model and aggressive caching of common sub-queries.
Q: How does this interact with ranking in AI Overviews?
Google AI Mode publicly uses a fan-out + multi-step retrieval pattern for complex queries, which is one reason content that supports multi-hop questions tends to surface across more cited sources. Tuning your own RAG with fan-out approximates that behaviour for internal applications.
Related Articles
How to write AI-citable claims: evidence patterns that get cited
A practical guide to writing claims AI engines actually cite: evidence patterns, sentence structures, and grounding tactics that boost citation-readiness in ChatGPT, Perplexity, and Google AI Overviews.
AI Search KPIs: Define, Calculate, and Report (Dashboard Spec)
A specification for AI search KPIs — citation rate, mention lift, share-of-answer, query coverage — with formulas, sampling rules, and a dashboard layout for GEO/AEO reporting.
Answer quality evaluation for grounded systems: rubric + test set design
Specification for evaluating grounded answer quality: a rubric across factuality, attribution, and coverage, plus how to design a stable test set and score it over time.