What is Reranking for AI Search
Reranking is a second retrieval stage that scores each query-document pair with a cross-encoder or LLM to reorder candidates retrieved by a faster first stage. It is a low-cost, high-leverage upgrade that consistently improves RAG citation accuracy, especially at small top-k values.
TL;DR
Reranking takes the top candidates from a fast retriever (vector or BM25) and reorders them with a more expensive but far more accurate model. It separates two jobs that single-stage retrievers fail to balance: scaling and precision. Modern AI search stacks — ChatGPT search, Perplexity, Gemini Deep Research, enterprise RAG — almost always include a reranker because it directly raises the quality of the chunks an LLM grounds on.
Definition
A reranker is a model that, given a query and a list of candidate documents, returns a relevance score for each pair so the candidates can be reordered from most to least relevant. In a typical AI search pipeline, the reranker is applied after a first-stage retriever (dense vector search, BM25, or a hybrid) returns a candidate set of 50-200 documents. The reranker then refines this candidate set down to the 3-10 chunks that are sent to the LLM as grounded context.
The two most common architectures are cross-encoders and LLM-based rerankers. Cross-encoders concatenate the query and each document and pass the pair through a transformer to produce a single relevance score, allowing every attention head to attend across query and document tokens (Cohere Rerank docs, 2025). LLM rerankers prompt a large language model to score or order candidates directly. Both architectures share one key property: unlike bi-encoder embedding models, they evaluate the query and document jointly rather than independently.
Why it matters
Naive RAG pipelines that retrieve top-k chunks with a single embedding model and pass them straight to the LLM frequently underperform. The cause is geometric: bi-encoder embeddings compress an entire passage into one vector, so cosine similarity rewards broad topical overlap rather than precise relevance. Two passages with nearly identical words but opposite meanings can land near each other in the embedding space, since the model is working from word content rather than structure.
Reranking addresses this by re-evaluating candidates with a model that sees the query and document together. Public benchmarks and practitioner reports consistently show meaningful precision gains. NVIDIA's reranker benchmark, for example, found that adding a strong cross-encoder on top of a mid-sized embedding retriever lifted nDCG@10 by roughly 5-15 percentage points across BEIR-style datasets (NVIDIA, 2024). Production RAG case studies commonly report end-to-end answer accuracy moving from the low 60s to the 80s after introducing a reranker, though exact gains depend heavily on the base retriever and corpus.
For AI search specifically, the cost of irrelevant chunks is asymmetric. A single off-topic passage can pull a generative answer off course, cause a citation to point at the wrong source, or trigger hallucinated synthesis. Because most generative AI systems display only a handful of citations to the user, the precision of the top-3 to top-5 chunks matters far more than recall at top-100. That is exactly the regime in which reranking has the largest measurable impact.
How it works
Reranking is the second stage of a two-stage retrieval pattern. The first stage maximizes recall by returning many candidates cheaply; the second stage maximizes precision by carefully reordering those candidates.
flowchart LR
Q["User query"] --> R1["Stage 1: Retriever
BM25 / dense / hybrid"]
R1 --> C["Top 50 to 200
candidate chunks"]
C --> R2["Stage 2: Reranker
cross-encoder or LLM"]
R2 --> T["Top 3 to 10
reranked chunks"]
T --> L["LLM grounding
and citation"]
L --> A["Answer"]Inside the reranker, a cross-encoder typically formats each pair as [CLS] query [SEP] document [SEP], runs the pair through a transformer encoder, and outputs a single relevance score from a classification head over the [CLS] token (Cohere, 2025). Because the model attends jointly across the pair, it can detect fine-grained signals — such as whether the document actually answers the question or merely shares topical keywords — that bi-encoders fundamentally cannot.
A multi-vector variant called late interaction takes a different path. ColBERT encodes the query and document independently into matrices of token-level embeddings, then computes a MaxSim operator between every query token and every document token (Khattab & Zaharia, 2020). This recovers most of the cross-encoder's expressive power while staying close to bi-encoder efficiency, which makes ColBERT-style models attractive for first-stage retrievers and lightweight reranking at scale.
LLM-based rerankers replace this scoring head with a generative model that is prompted, often with few-shot examples, to either output a relevance score per document or to produce a ranked list directly. They tend to win on tasks that require complex reasoning over the candidate ("does this passage actually answer the user's question?") at the cost of higher latency and price.
Latency considerations shape every reranker deployment. Cross-encoders run inference on each query-document pair, so cost grows linearly with candidate count. Production stacks therefore tune three knobs: the size of the first-stage candidate set (typically 50-200), the size of the reranker model (small distilled models for low latency, larger ones for accuracy), and the number of chunks passed to the LLM after reranking (typically 3-10).
Comparison vs related approaches
| Approach | Architecture | Latency per query | Best for |
|---|---|---|---|
| No rerank (top-k embeddings) | Bi-encoder only | Lowest | Simple FAQ, narrow corpora |
| Cross-encoder rerank | Joint query-doc transformer | Medium | General-purpose RAG, AI search |
| ColBERT / late interaction | Multi-vector token MaxSim | Medium-low | Large corpora, scalable rerank |
| LLM-based rerank | Prompted LLM scoring | Highest | Complex reasoning, agentic search |
| Hybrid search alone | Dense + lexical fusion | Low | Term-heavy queries (code, IDs) |
Cross-encoders dominate the practical middle of this spectrum. Compared to no reranking, they convert a noisy candidate list into a precision-ordered one, which directly raises top-3 citation quality. Compared to LLM rerankers, cross-encoders are typically several times cheaper per query at comparable accuracy (Voyage AI, 2025), making them viable for high-throughput search.
ColBERT and other late-interaction models occupy a useful middle ground: they recover much of the cross-encoder's accuracy at lower latency and can be used either as the primary retriever or as a lightweight reranker. LLM rerankers, by contrast, are the right call when the system already pays an LLM cost per query and when reasoning over candidates ("which of these explains the cause, not just the symptom?") is part of the relevance definition.
Hybrid search is complementary, not competitive. A typical modern stack runs hybrid first-stage retrieval (dense + BM25), then applies a cross-encoder or LLM reranker on the union.
Practical application
A reasonable default RAG-with-rerank pipeline looks like this:
- Index the corpus with chunked passages (200-500 tokens), an embedding model, and a BM25 inverted index.
- First-stage retrieval: run dense and lexical search in parallel for the user query and merge the top-100 candidates with reciprocal rank fusion.
- Rerank: pass the merged candidates to a cross-encoder reranker (Cohere Rerank, Voyage rerank-2.5, or BGE reranker v2) with the query.
- Truncate to the top-5 to top-10 reranked chunks.
- Ground: send the truncated context, with citations, to the LLM as the prompt for answer generation.
- Evaluate retrieval quality with offline labelled judgments (nDCG@10, Recall@k) and online metrics (citation click-through, user thumbs).
Choosing a reranker comes down to four constraints. First, latency: hosted APIs typically add a few hundred milliseconds per query depending on candidate count and model size. Second, cost: API calls bill per token of query plus documents reranked, so candidate caps matter. Third, multilinguality: Cohere Rerank 4, Voyage rerank-2.5, BGE Reranker v2-m3, and Jina Reranker v2 all advertise multilingual training (Cohere, 2025; Voyage, 2025). Fourth, deployability: Cohere and Voyage are API-only by default, while BGE and Jina ship open weights for self-hosting.
Practical tuning patterns that pay off:
- Right-size the candidate set. Setting first-stage top-k to 50 instead of 10 typically raises recall enough that reranking has the headroom to deliver large precision gains.
- Cap reranker top-k by token budget. Pick the largest k that fits the LLM's context window without diluting attention.
- Use rerank scores, not just ranks. Many reranker APIs return calibrated relevance scores; thresholding lets the system gracefully degrade to "I don't know" when no candidate is strongly relevant.
- Cache common queries. Reranking the same query against the same candidate set is deterministic and worth caching.
- Re-evaluate periodically. Rerankers and embedding models evolve quickly; re-benchmark every few quarters against your own labelled set.
Examples
1. Cohere Rerank 3.5 / 4
Cohere's hosted Rerank API takes a query and a list of documents and returns a sorted list with relevance scores. The Rerank 4 family ships in pro and fast variants, supports 100+ languages with a single multilingual model, and is widely integrated into vector databases such as Weaviate, Pinecone, and OpenSearch (Cohere, 2025).
2. Voyage rerank-2.5 and rerank-2.5-lite
Voyage AI's rerank-2.5 series adds an instruction-following capability: callers can prepend natural-language instructions to the query (for example, "prefer documents from after 2024" or "favor primary sources") and the model adjusts its scoring accordingly (Voyage AI, 2025). Voyage's documentation emphasizes use after a Voyage embedding-based first stage but the reranker is model-agnostic.
3. BGE Reranker v2
BAAI's BGE Reranker family is the most widely used open-weights cross-encoder line. The bge-reranker-v2-m3 checkpoint is multilingual, runs on commodity GPUs, and is the default open reranker in many self-hosted RAG stacks (BAAI, 2024). It scores well on standard benchmarks and is straightforward to fine-tune on small in-domain datasets.
4. ColBERTv2 and Jina-ColBERT-v2
ColBERTv2 from Stanford FutureData and Jina-ColBERT-v2 implement late-interaction retrieval and reranking, exposing token-level multi-vector embeddings rather than a single relevance score (Khattab & Zaharia, 2020; Jina AI, 2024). This makes them well-suited to large corpora where end-to-end neural reranking would be too slow.
5. NVIDIA NV-RerankQA-Mistral-4B-v3
NVIDIA's Mistral-based reranker is an example of an LLM-style reranker tuned for retrieval. In NVIDIA's own benchmark across financial, medical, and academic datasets, NV-RerankQA-Mistral-4B-v3 outperformed a strong embedding-only baseline by roughly 14% nDCG (NVIDIA, 2024). It illustrates the LLM-reranker tradeoff: substantially higher accuracy at substantially higher latency and compute.
6. Cross-encoder rerankers from Sentence Transformers (ms-marco)
The cross-encoder/ms-marco-MiniLM-L-12-v2 family is a long-running open baseline. It is small, fast, and unsurprisingly outperformed by newer multilingual rerankers, but remains useful for self-hosted English-only stacks where latency budgets are tight.
Common mistakes
- Skipping reranking on "good enough" embeddings. Even strong embedding models like voyage-3 or text-embedding-3-large benefit from a reranker on small top-k. The mistake is treating embedding model upgrades as a substitute for reranking.
- Reranking too few candidates. If the first stage only returns 10 candidates and the gold passage is at rank 50, no reranker can recover it. Set first-stage top-k generously (50-200) before reranking.
- Ignoring score calibration. Treating rerankers purely as orderings discards a useful "no relevant document" signal. Threshold on the top reranker score to power graceful "I don't know" answers.
- Mismatched languages. Using an English-only reranker on a multilingual corpus silently degrades non-English queries. Pick a multilingual checkpoint when in doubt.
- Reranking unchunked or oversized passages. Cross-encoders truncate inputs at fixed token limits; passages longer than the model's max sequence length lose their tail. Chunk first, rerank chunks.
FAQ
Q: Is reranking the same thing as RAG?
No. RAG is the overall pattern of retrieving documents and grounding an LLM on them; reranking is one optional stage inside the retrieval half of RAG. A RAG system can work without a reranker, just usually less precisely.
Q: When should I add a reranker to my pipeline?
Add a reranker as soon as you have a working baseline and start measuring retrieval quality. If your top-3 chunks contain irrelevant passages a meaningful share of the time, a cross-encoder reranker is the highest-leverage next change you can make.
Q: Cross-encoder vs LLM reranker — which should I pick?
Default to a cross-encoder (Cohere Rerank, Voyage rerank-2.5, BGE Reranker v2). Move to an LLM reranker only when the relevance definition itself requires reasoning ("explain causes, not symptoms") and the per-query latency budget allows it.
Q: How many candidates should I send to the reranker?
50 to 200 candidates is the typical range. Below 50, the reranker has too little headroom to improve precision; above 200, latency and cost rise faster than accuracy. Tune empirically against a labelled evaluation set.
Q: Does reranking eliminate hallucinations?
It reduces but does not eliminate them. Reranking improves the relevance of the grounded chunks, which in turn lowers hallucination rates. But the LLM can still hallucinate if it ignores the context, if the context is missing the answer, or if the answer requires synthesis beyond the retrieved chunks.
Q: Can I fine-tune a reranker on my own data?
Yes, and small in-domain datasets (a few thousand labelled query-document pairs with relevance judgments) typically deliver large gains, especially for vertical domains such as legal, medical, or technical documentation. Open-weights rerankers like BGE and Jina expose standard fine-tuning recipes.
Q: How does reranking interact with hybrid search?
They are complementary. Hybrid search is a first-stage technique that combines lexical and dense retrieval to maximize recall; reranking is a second-stage technique that maximizes precision over the hybrid candidate set. Production stacks typically use both.
Q: Do AI search engines like Perplexity and ChatGPT use rerankers?
Public RAG architecture posts and engineering interviews from major AI search providers consistently describe two-stage retrieval with reranking, although the exact models and pipelines are proprietary. The pattern is now standard across enterprise and consumer AI search.
Related Articles
What Is Passage Retrieval?
Passage retrieval extracts the most relevant paragraph from a page to answer a query. Learn how it powers AI Overviews, citations, and AEO.
Agent Knowledge Base Specification: Structure, Refresh, and Versioning
Production specification for AI agent knowledge bases: document model, chunking strategies, metadata enrichment, refresh cadence, version pinning, and rollback.
Grounding vs Fact-Checking: What's the Difference in AI Content Workflows?
Grounding anchors AI answers to trusted sources before generation; fact-checking verifies claims after generation. Learn when each belongs in your AI content workflow.