Source selection for grounding: ranking sources by trust, freshness, and specificity

A grounding source-selection framework scores every candidate source on three dimensions — trust, freshness, and specificity — applies hard exclusions, and combines the dimensions into a single weighted score that the retriever and reranker share. The framework raises evidence quality without inflating retrieval latency or token cost.

TL;DR

Do not pass every retrieved chunk to the LLM. Score each candidate source on trust (provenance, authority), freshness (recency relative to the topic's volatility), and specificity (chunk-to-query semantic match). Apply hard exclusions first, then rank by 0.4 · trust + 0.3 · freshness + 0.3 · specificity. Pass the top K to the model, where K is bounded by your token and latency budget.

What the framework is

Source selection for grounding is the layer between retrieval and generation that decides which retrieved sources actually flow into the LLM's context window. RAG retrieves; source selection chooses. Without it, retrievers return relevant-but-noisy candidates and the model picks the wrong one to ground its answer.

The framework has four parts:

A source registry with metadata for trust and freshness.
Hard exclusions that remove disqualified sources outright.
A weighted score combining trust, freshness, and specificity.
A budget rule that caps how many sources reach the model.

Why source selection matters

RAG hallucinations cluster around five failure modes; three of them — wrong evidence retrieved, conflicting evidence, and outdated evidence — are source-selection failures, not retrieval failures. Selecting better sources fixes those classes without changing the retriever or the model.

Source selection also controls cost. A well-tuned selector lets you run a smaller K (e.g. 3 sources instead of 10) at the same answer quality, which compounds across millions of queries.

The three dimensions

1. Trust

Trust measures how much weight you should give a source independent of the query. It is computed once per source and refreshed on a slow cadence (weeks).

Factors:

Provenance. Primary sources (vendor docs, schema.org, peer-reviewed papers, government data, your own canonical pages) > tier-1 industry blogs > general blogs > forums > unattributed text.
Author authority. Named author with a verifiable identity > pseudonymous > anonymous.
Editorial signal. Date stamps, citations, byline consistency, error-correction history.
Network signal. Inbound citations from other trusted sources.

Normalize trust to a 0-1 score per source. Most teams maintain 5-7 trust tiers and map them to fixed scores (1.0, 0.85, 0.7, 0.55, 0.4, 0.25, 0.1).

2. Freshness

Freshness measures how recent the source is, weighted by topic volatility:

High-volatility topics (pricing, model versions, leadership, security advisories): source older than 30 days is heavily penalized.
Medium-volatility topics (best practices, market data): source older than 12 months is penalized.
Low-volatility topics (definitions, historical facts): source can be years old without penalty.

A practical formula:

fresh = exp(-age_days / half_life_days)

where half_life_days is 30 for high volatility, 365 for medium, 1825 (5 years) for low. The result is a 0-1 score that decays gracefully rather than cliffs at an arbitrary cutoff.

3. Specificity

Specificity measures how directly the source's chunk answers the query. Unlike trust and freshness, it is computed at retrieval time, not in the registry.

Factors:

Embedding similarity between the chunk and the query (cosine, normalized 0-1).
Lexical overlap of canonical entities (boost when entity ids match exactly).
Chunk type fit — a definition query should prefer a definition chunk; a comparison query should prefer a table or list chunk.
Reranker score from a cross-encoder pass on the top N candidates.

Use the reranker score as the primary specificity signal once you have one; embedding similarity is a fallback.

Hard exclusions (apply first)

Before scoring, drop any candidate that hits a hard exclusion. Hard exclusions are absolute, not weighted:

Domain blocklist (known low-quality, machine-generated, or fraud sites).
Stale beyond half-life × 5 for high-volatility topics.
Conflicting policy or pricing with the canonical source-of-truth on the same property.
PII or licensed content the model is not allowed to surface.
Robots/legal flags from the source registry.

Hard exclusions exist because no amount of trust or specificity should let a fraud site or 6-year-old pricing page reach the model.

The combined score

A simple, defensible weighting that works for most domains:

score = 0.4 trust + 0.3 freshness + 0.3 * specificity

The weights are tunable per domain. Heuristics:

Regulated domains (medical, legal, financial): increase trust to 0.55, decrease specificity to 0.2.
Volatile domains (security advisories, pricing): increase freshness to 0.45, decrease trust slightly.
Knowledge-base domains (technical references, definitions): increase specificity to 0.4, decrease freshness to 0.2.

Keep the weights summing to 1.0 so the score remains 0-1 and comparable across queries.

The budget rule

Never pass all candidates above the score threshold. Cap K explicitly:

K = 3 for short answers (definition, FAQ).
K = 5-8 for analytical answers (comparison, framework).
K = 10+ only for research-mode tasks where latency is acceptable.

The budget rule has three benefits: it caps latency, it caps token cost, and it forces the framework to be discriminating rather than permissive.

Putting it together: the selection loop

For each query:

Retrieve the top N (typically 30-50) candidates with the vector or hybrid retriever.
Apply hard exclusions to the candidate list.
Compute trust from the source registry; cache results.
Compute freshness using the topic-appropriate half-life.
Compute specificity with the reranker (or fallback to embedding similarity).
Combine with the weighted formula.
Sort descending; truncate to K.
Pass to the model with provenance metadata so the answer can cite each source.

Log every selection decision (query id, candidate id, dimension scores, final score, included/excluded). The log is what makes the framework debuggable when a hallucination triage incident happens.

Anti-patterns

Top-K by similarity only. Misses trust and freshness; common cause of confidently wrong answers.
Trust as binary include/exclude. Loses gradient; either too restrictive or too permissive.
Freshness cliffs. Hard cutoffs at "older than 1 year = drop" misclassify low-volatility topics.
Skipping the reranker. Retrieval similarity alone overweights surface lexical overlap.
Unbounded K. Drives cost and latency without proportional quality lift.

Calibration and monitoring

The weights are not set-and-forget. Calibrate quarterly:

Sample 50-200 queries per domain. Have a human rate the answer's faithfulness.
Vary weights and rerun the rated set. Pick the weight set that maximizes faithful answers at a fixed K.
Monitor citation-rate drift on tracked queries; investigate any month-over-month drop > 10%.
Re-audit the source registry every quarter; trust scores decay if a source loses its editorial signal.

FAQ

Q: Why is trust weighted higher than freshness or specificity by default?

In most domains, a stale or slightly off-topic answer from a primary source is less harmful than a confident answer from a low-trust source. Trust is also the dimension users notice most when they audit citations. The default 0.4 / 0.3 / 0.3 split tilts the framework toward verifiable provenance while keeping freshness and specificity meaningful tie-breakers.

Q: How often should freshness scores be recomputed?

At retrieval time. Freshness depends on now, so caching it is risky for fast-moving topics. Trust scores can be cached for days to weeks; freshness should be computed inline.

Q: What is a reasonable K when latency matters?

K = 3 is a strong default for direct-answer use cases. Dropping below 3 starts costing answer quality on multi-fact questions; raising above 8 rarely improves quality but reliably increases token cost and latency.

Q: How do I handle two sources that contradict each other?

If both pass hard exclusions, prefer the one with the higher trust score. If trust ties, prefer the fresher source. If both tie, surface both to the model and instruct it to flag the conflict in the answer rather than picking silently. Conflicts are also a signal for hallucination triage.

Q: Does this framework apply to public AI engines like ChatGPT and Perplexity?

Indirectly. Public engines run their own source-selection layers, but the framework still tells you how to make your pages score well as candidates: clear provenance, current dateModified, and chunk-level specificity (one canonical question per heading). Pages that score high on this rubric internally tend to be the same pages public engines cite externally.