Geodocs.dev

What is query fan-out? Optimizing multi-query retrieval for RAG

ShareLinkedIn

Open this article in your favorite AI assistant for deeper analysis, summaries, or follow-up questions.

Query fan-out is the technique of generating multiple sub-queries from one user question, retrieving for each in parallel, and merging the results before generation. It increases recall on broad, ambiguous, or multi-hop queries but multiplies cost and latency by the fan-out factor. Tune fan-out with a router that triggers it only when needed, a capped fan-out factor, reciprocal rank fusion to merge candidates, and explicit recall + faithfulness measurement.

TL;DR

Fan-out helps when one query cannot capture everything the user is asking. It hurts when you apply it indiscriminately: cost, latency, and reranker work all scale with the number of sub-queries. The right policy is a router that decides per query whether to fan out, a capped fan-out factor (typically 3-5), parallel retrieval with cached embeddings, dedupe + reciprocal rank fusion to merge results, and a small evaluation set that measures answer recall and citation faithfulness with and without fan-out enabled.

Definition

Query fan-out is a retrieval pattern where the system rewrites or decomposes a single user query into multiple sub-queries, runs retrieval on each in parallel, and merges the results into a single candidate set used by the answer generator. The number of sub-queries is the fan-out factor.

Variants you will see in the literature:

  • Query rewriting (paraphrase fan-out). Generate paraphrased versions of the same query (LangChain MultiQueryRetriever).
  • Query decomposition. Break a multi-hop question into atomic sub-questions, each retrievable independently.
  • HyDE (Hypothetical Document Embeddings). Generate a hypothetical answer first, embed that, and retrieve.
  • Query2Doc. Generate pseudo-documents that expand the query before embedding.
  • Multi-source fan-out. Same query, different stores or indices (vector, BM25, knowledge graph, web).

Google AI Mode publicly describes a fan-out + multi-step retrieval architecture for complex queries; the same general pattern is now standard in production RAG.

When to fan out (and when not to)

Fan-out is not free. Use it only when the query is likely to under-retrieve without it.

Fan out when:

  • The query contains conjunctions or comparisons ("compare X and Y", "differences between A and B").
  • The query has multiple sub-questions implicitly ("what is X, why does it matter, and how do I implement it").
  • The query is ambiguous and could match different document subspaces (acronyms, polysemous terms).
  • The corpus has heterogeneous phrasing for the same concept and a single embedding misses synonyms.
  • The downstream answer must cite multiple sources (e.g., AI search answers that cite 3-8 distinct sources).

Skip fan-out when:

  • The query is a single, well-formed factual lookup ("capital of France", "price of SKU 1234").
  • The corpus is small and dense; one query usually retrieves the relevant chunks.
  • Latency budget is tight (e.g., chat-style sub-second SLAs without async streaming).
  • You already have a strong reranker that handles paraphrase variation well.

A cheap router — a small classifier or a few-shot LLM call — makes the per-query decision and is usually itself a sub-100 ms operation.

The cost and latency math

For a fan-out factor k:

  • Embedding cost ≈ k × single-query embedding cost (mostly negligible for small models).
  • Retrieval cost ≈ k × single-retrieval latency, parallelisable across k workers; effective latency ≈ max(retrieval latency).
  • Rewrite/decomposition cost ≈ 1 LLM call to produce all k sub-queries (~100-400 ms with a small/fast model).
  • Reranker cost ≈ reranker work scales with the merged candidate-set size, often ~k × baseline before deduplication, less after.
  • Generation cost is unchanged in tokens, but context-quality changes.

Rule of thumb: fan-out at k=3-5 adds ~150-400 ms of end-to-end latency (rewrite + parallel retrieval + reranking) and roughly 1.5-3× the retrieval-side cost. Anything above k=8 is rarely justified by quality gains.

Dedupe and merge

With multiple sub-queries you will see overlapping candidates. The naive merge (concat + take top-N by score) underweights documents that appear in multiple sub-queries' results, which is exactly the wrong signal.

Use one of:

  • Reciprocal rank fusion (RRF). Score each document by Σ 1 / (k + rank_i) across sub-query result lists. Robust, parameter-light, widely used.
  • Score normalisation + sum. Min-max normalise per-list scores, then sum across sub-queries.
  • Reranker over the union. Take the union, deduplicate by chunk ID, send to a cross-encoder reranker, take top-N.

Deduplicate before reranking, not after. Reranker cost scales with candidate-set size; running it on duplicates wastes compute.

Routing strategy

A production-grade fan-out pipeline has at least three decisions encoded as a router:

  1. Should we fan out at all? Binary classifier or few-shot LLM, output: yes/no.
  2. What kind of fan-out? Paraphrase, decomposition, HyDE, multi-source.
  3. What fan-out factor? Typically a fixed cap (e.g., 4) with the actual count chosen by the rewriter.

Keep these decisions inspectable. Log the router's verdict and the generated sub-queries on every call so you can audit failures and tune thresholds.

Measuring fan-out impact

Do not ship fan-out without measurement; the failure modes are subtle.

  • Answer recall@k. On a test set with known gold sources, what fraction of those sources appears in the merged candidate set with and without fan-out?
  • Citation faithfulness. When the system cites a source, is the cited claim actually supported by that source? Tools like RAGAS, FACTS Grounding, and HHEM provide automated scoring; manual review is still required at the margins.
  • Answer recall change. Compare the same questions with and without fan-out enabled; expect single-digit-percent gains on broad queries and minor regressions on simple lookups.
  • Latency p50 and p95. Fan-out is a tail-latency feature; the p95 matters more than the p50.
  • Cost per answer. Track total $ per answered query; fan-out should pay for itself in answer-quality KPIs.

A practical evaluation cadence: weekly on a 100-300 query golden set, with the router enabled and disabled side by side.

Common pitfalls

  • Fanning out everything. The single biggest cost mistake; route, don't blanket-apply.
  • Setting k too high. Quality gains plateau quickly; cost and latency do not.
  • No deduplication. Repeated chunks crowd the context window and inflate reranker cost.
  • Ignoring multi-hop structure. For multi-hop questions, paraphrase fan-out is weaker than decomposition; pick the right variant.
  • No router visibility. If you cannot see the sub-queries the rewriter produced, you cannot debug bad answers.
  • Reranker without fan-out tuning. A strong reranker masks fan-out problems until you change models; build the eval harness early.

Reference policy

A defensible default policy for production RAG:

  • Route every query through a binary classifier; fan out only when it predicts the query is broad, ambiguous, or multi-hop.
  • Cap fan-out factor at 4. Allow the rewriter to emit fewer sub-queries when the query is tight.
  • Run sub-queries in parallel against the same vector index; add a BM25 sub-query for entity-heavy questions.
  • Deduplicate by chunk ID, then merge with reciprocal rank fusion (k=60).
  • Send the top 30-60 deduped chunks to a cross-encoder reranker; pass the top 6-10 to the generator.
  • Log: original query, router verdict, sub-queries, per-sub-query top-K, merged top-N, cited sources.
  • Evaluate weekly with answer recall and citation faithfulness on a frozen golden set.

FAQ

Q: How is query fan-out different from query expansion?

Query expansion historically meant adding synonyms or related terms to a single query. Fan-out runs retrieval multiple times with multiple distinct queries and merges the results, which generally outperforms simple expansion on broad or multi-hop questions.

Q: Does fan-out help with hallucination?

Indirectly. Better recall reduces the chance the generator answers from parametric knowledge alone. But fan-out can also retrieve more weakly relevant chunks, so faithfulness scoring and a strong reranker matter at least as much.

Q: When should we use HyDE instead of paraphrase fan-out?

HyDE works well when the query is short and the corpus is rich in detail; embedding a hypothetical answer surfaces semantically similar long-form chunks. Use paraphrase fan-out when the query already contains the entities and you mainly need to handle wording variation.

Q: Does fan-out require an LLM at retrieval time?

Usually yes, for the rewriter or decomposer. You can amortise cost with a small fast model and aggressive caching of common sub-queries.

Q: How does this interact with ranking in AI Overviews?

Google AI Mode publicly uses a fan-out + multi-step retrieval pattern for complex queries, which is one reason content that supports multi-hop questions tends to surface across more cited sources. Tuning your own RAG with fan-out approximates that behaviour for internal applications.

Related Articles

guide

How to write AI-citable claims: evidence patterns that get cited

A practical guide to writing claims AI engines actually cite: evidence patterns, sentence structures, and grounding tactics that boost citation-readiness in ChatGPT, Perplexity, and Google AI Overviews.

specification

AI Search KPIs: Define, Calculate, and Report (Dashboard Spec)

A specification for AI search KPIs — citation rate, mention lift, share-of-answer, query coverage — with formulas, sampling rules, and a dashboard layout for GEO/AEO reporting.

specification

Answer quality evaluation for grounded systems: rubric + test set design

Specification for evaluating grounded answer quality: a rubric across factuality, attribution, and coverage, plus how to design a stable test set and score it over time.

Cập nhật tin tức

Thông tin GEO & AI Search

Bài viết mới, cập nhật khung làm việc và phân tích ngành. Không spam, hủy đăng ký bất cứ lúc nào.