Answer quality evaluation for grounded systems: rubric + test set design

A grounded-answer evaluation system scores each response along three core axes — factuality, attribution, and coverage — against a frozen test set with golden references; reliable scores depend on splitting retrieval and generation evaluation, calibrating LLM judges against human annotators, and reporting confidence intervals across runs.

TL;DR

Grounded-answer quality is not a single number. Use a multi-axis rubric (factuality, attribution, coverage, calibration, completeness, conciseness) applied to a frozen test set of representative queries with golden context and answers. Track retrieval and generation separately, calibrate judges against human raters, and report scores with variance bands so changes are attributable to specific pipeline edits.

What "grounded answer quality" means

A grounded answer is one whose factual claims are supported by the retrieved context, not by parametric model memory. Evaluation must therefore answer two distinct questions:

Did retrieval surface the evidence needed to answer the query?
Did generation use that evidence faithfully, completely, and with correct attribution?

These two questions correspond to the well-known RAG triad of context relevance, groundedness/faithfulness, and answer relevance, popularised by TruLens and reused by Ragas, DeepEval, and Galileo.

Rubric: six scoring axes

Each axis is scored on a 0-4 scale with anchored definitions. Combine into a weighted Grounded Answer Score (GAS) between 0 and 100.

1. Factuality (weight 25)

Every load-bearing claim in the answer is verifiable against the retrieved context or an authoritative external source listed in the test set. Score 4 = all claims supported; 3 = one minor unsupported detail; 2 = one material unsupported claim; 1 = multiple unsupported claims; 0 = core claim fabricated.

2. Attribution (weight 20)

Citations resolve to the specific passage that supports the cited claim. Empty, mismatched, or over-broad citations lose points. Citation accuracy in ungated RAG averages only 65-70%, so attribution must be scored explicitly, not assumed.

3. Coverage (weight 20)

The answer addresses every part of a multi-part query and includes the highest-priority facts marked in the golden reference.

4. Calibration / Refusal (weight 15)

When evidence is missing, the system refuses or flags uncertainty rather than guessing. False refusals are penalised symmetrically with hallucinated answers.

5. Completeness vs Conciseness (weight 10)

The answer covers required facts without padding. Verbose answers that recite irrelevant retrieved text indicate poor generation conditioning.

6. Format compliance (weight 10)

Answer respects the expected structure: short direct answer first, optional supporting detail, and machine-readable citation format.

Failure-mode taxonomy

Tag every failed answer with at least one code so trends are diagnosable: HALLUCINATION_FACT, UNSUPPORTED_QUALIFIER, MISATTRIBUTION, OVER_BROAD_CITATION, MISSING_REFUSAL, OVER_REFUSAL, INCOMPLETE_ANSWER, CONTEXT_IGNORED, CONFLICTING_SOURCES_UNRESOLVED, FORMAT_DRIFT. Frameworks like GroUSE warn that single-score judges miss several of these; tagging makes them visible.

Test set design

A test set is the contract that makes scores comparable across runs.

Composition

A production-grade test set has at least 300 queries distributed across coverage queries (40%), partial-coverage queries (20%), out-of-corpus queries that must be refused (15%), ambiguous queries (10%), adversarial queries (10%), and long-tail entity queries (5%). Each row contains: query_id, query, intent, golden_context_ids, golden_answer, required_facts[], forbidden_claims[], expected_refusal (bool), and tags[].

Sampling and freezing

Sample queries from real production logs after PII scrubbing, then freeze. Re-sampling every release destroys comparability. Maintain a separate rolling holdout of 50 fresh queries per quarter to detect distribution drift without touching the frozen anchor.

Golden answers

Three independent annotators write the golden answer; a fourth adjudicator merges. Mark each fact with a priority weight (required vs nice_to_have) so scoring can distinguish missing critical facts from missing trivia.

Source freshness

Pin the corpus snapshot used to build golden answers. Re-grade only when the corpus is intentionally refreshed; otherwise stale ground truth will mislabel correct new behaviour.

Scoring workflow

Run the pipeline on the frozen test set with deterministic settings (temperature=0, fixed retriever k, fixed reranker).
Score retrieval with Recall@k, Precision@k, and nDCG against golden_context_ids. Low retrieval recall caps maximum factuality regardless of generation quality.
Score generation with the six-axis rubric using an LLM-as-a-judge calibrated against human annotators.
Aggregate per-axis means and the weighted GAS, plus failure-tag frequencies.
Bootstrap-resample to produce 95% confidence intervals; treat improvements smaller than the CI as noise.

LLM-as-a-judge calibration

LLM judges are cost-effective but biased. Calibrate before trusting them:

Agreement target: ≥80% accept/reject agreement with a 100-item human-rated subset before promoting a judge to CI.
Anti-bias prompt structure: force the judge to extract claims, cite supporting spans, score each axis independently, then emit final JSON. Chain-of-thought judges outperform single-shot rating prompts on subtle errors.
Judge-vs-judge consistency: rerun the same items with two judge models; flag items where judges disagree by more than one rubric point for human review.
Avoid self-judging: do not use the same model family as judge and generator on the same queries.

GroUSE-style meta-evaluation showed that even GPT-4 judges miss critical failure modes when the rubric has only one axis; multi-axis prompts and unit-test cases improve calibration.

Operating the metric

Per-PR CI: run a 50-query smoke subset; gate merges on no-regression in factuality and attribution.
Nightly: full 300-query run, dashboard with per-axis trend lines, failure-tag heatmap.
Quarterly: rolling holdout run + human spot check of 100 sampled answers to detect judge drift.

Track three meta-metrics so the rubric itself does not silently rot: judge-human agreement rate on a re-rated 100-item slice, score variance across three repeat runs with temperature=0, and test-set staleness (percentage of queries whose golden answer no longer matches the current corpus).

Internal references

What is Answer Grounding?
How to build an Answer Grounding pipeline (end-to-end)
Attribution in AI answers
How to write AI-citable claims
Technical hub

FAQ

Q: Why score retrieval and generation separately instead of one end-to-end number?

End-to-end scores hide where failures originate. A 60% answer accuracy could mean 60% retrieval recall with perfect generation, or 100% retrieval with poor faithfulness; the fixes are completely different.

Q: How big should the test set be?

Production teams converge around 300-500 frozen queries plus a 50-query rolling holdout. Below ~200 the per-axis confidence intervals are too wide to detect realistic regressions; above ~1,000 cost dominates without proportional signal.

Q: Can I rely on Ragas faithfulness alone?

No. Ragas faithfulness is a useful component but evaluates only one axis and is sensitive to claim-decomposition quality. Combine with attribution scoring, calibration/refusal scoring, and a small human-rated calibration set; treat Ragas, TruLens, and DeepEval scores as inputs, not the final verdict.

Q: How do I prevent the rubric from drifting over time?

Version the rubric, anchored definitions, and judge prompt as code. Re-calibrate the judge against a fresh human-rated subset each quarter, and require a written changelog entry for any rubric edit.

Q: How do I score answers when sources conflict?

Add an explicit CONFLICTING_SOURCES_UNRESOLVED failure tag. The expected behaviour is to surface the conflict rather than silently picking one. Score calibration high when the system flags the conflict, low when it picks one without acknowledgement.