Judge Validation Manifest¶

Diátaxis: reference — factual lookup for InvisibleBench's LLM-as-judge infrastructure. For the scoring architecture itself, see scoring-rubric.md. For methodology rationale, see methodology.md.

Purpose¶

InvisibleBench evaluates models using a mix of deterministic scorers and LLM-as-judge scorers. The judge prompt templates themselves are kept private to prevent gaming, but the runtime emits stable prompt-template hashes with every scored result so that external readers can verify that two runs used the same judge template — even without access to the prompt text.

This document lists the judges, their template-hash slots, and the current status of judge-validation work (TPR/TNR against human-labelled ground truth). It is the public companion to the benchmark card's "judge validation in progress" note: this page tracks what is and is not validated.

Judges in the pipeline¶

Each scored scenario result carries per-scorer metadata. Only LLM-backed scorers have a judge_prompt_hash; deterministic scorers do not.

Scorer	Type	Role	Has `judge_prompt_hash`?
`safety`	LLM judge	Crisis-gate: binary safety fail for mental-health high-severity triggers	yes
`compliance`	LLM judge	Structured extraction of diagnosis / prescribing / scope-claim violations	yes
`regard`	LLM judge	Conversational quality (warmth, attunement, coordination, trauma-informed)	yes
`coordination`	Deterministic	Regex-based proxy for coordination signals — no LLM, no hash	no
`memory`	Deterministic	Probe-based recall/continuity scoring for multi-session scenarios — no LLM, no hash	no

See benchmark/configs/prompts/README.md for the list of required prompt files and their template variables. The .txt files themselves are gitignored.

How template hashes are computed¶

compute_prompt_template_hash(*parts) (in src/invisiblebench/api/client.py) takes the static prompt-template text — not the fully rendered per-scenario prompt — and returns a SHA-256 of the whitespace-normalized join.

This means:

Editing a judge prompt template produces a new hash; old runs and new runs are no longer comparable on that judge.
Changing scenario content does not change the hash; per-scenario rendering is applied after hashing.
The hash is a stable identifier for the judge behavior contract, not for any one invocation.

Every ScenarioResult written by the runner includes judge_prompt_hash at the top level (primary judge) and per-dimension hashes inside the individual scorer outputs. The leaderboard artifacts under data/leaderboard/ echo the top-level value for each row.

Current template hashes¶

Hashes below are derived from the public leaderboard artifacts (data/leaderboard/leaderboard.json, contract version 2.1.0).

Judge	Template hash (SHA-256, first 16)	Scope	Source
`regard`	`dc9c89876f57d179…`	Primary top-level `judge_prompt_hash`	`data/leaderboard/leaderboard.json`
`safety`	pinned per-result under `scorer_details`	Not surfaced at leaderboard top level	`scorer_details.safety.judge_prompt_hash`
`compliance`	pinned per-result under `scorer_details`	Not surfaced at leaderboard top level	`scorer_details.compliance.judge_prompt_hash`

To extract per-scorer hashes from a scored run, read the per-scenario result JSON written by bench into results/<run-id>/. The raw result payload preserves scorer-level judge_prompt_hash entries that the leaderboard summary collapses.

Validation status¶

Judge validation compares judge verdicts against a human-labelled ground-truth subset. Each judge is labelled with one of:

validated — TPR and TNR reported against a labelled sample ≥ N.
fixed-unvalidated — prompt template is frozen under its current hash but has not yet been measured against labels at the required sample size.
in-progress — labelling or re-scoring work is live; numbers will move.
unvalidated — no labelled comparison has been attempted.

Judge	Status	TPR	TNR	Sample size	Hash pin	Notes
`safety`	validated	1.000	1.000	60	per-result*	Crisis-gate judge; validated on the resolved 60-trace gold set (`4` fail / `56` pass on the safety gate).
`compliance`	validated	1.000	1.000	60	per-result*	Structured extraction; validated on the resolved 60-trace gold set (`11` fail / `49` pass on the compliance gate).
`regard`	in-progress	n/a	n/a	60	`dc9c8987…`	Measured against the resolved 60-trace gold set; current scorer collapses to `pass` on all four regard axes often enough that agreement is still too weak for validation-grade use.
`coordination`	deterministic	n/a	n/a	n/a	n/a	Known floor-effect (regex proxy); see methodology.md.
`memory`	deterministic	n/a	n/a	n/a	n/a	Probe-based; scored against scenario-authored expected strings.

* safety and compliance expose per-result hashes under scorer_details.<scorer>.judge_prompt_hash; the leaderboard-level hash shown is the regard/primary judge hash and is not equivalent.

Safety and compliance are now calibrated on the resolved gold set. Regard has now also been measured on that same 60-trace set, but the current quality judge still is not validation-grade: it tends to over-predict pass, so close leaderboard deltas should still be read cautiously.

Calibration-set apparatus¶

The calibration set is now resolved gold internally: 60 stratified traces across contested-false-scope, clinical-boundary, crisis, and clean-pass buckets; per-candidate label templates; an LLM-drafted "silver" prior; two independent human passes; and conflict resolution into labels/gold/.

Current internal validation artifacts live under internal/evals/verifier/golden_set/, especially:

current_scorer_vs_gold.md / current_scorer_vs_gold.csv
current_regard_vs_gold.md / current_regard_vs_gold.csv
verifier_validation.md
gold_resolution_summary.md

External reproducibility¶

A third party who cannot access private prompt text can still:

Re-run the benchmark with their own judge prompts and compare top-line deltas; absolute numbers will differ.
Verify that two published runs used the same judges by comparing judge_prompt_hash values per-scorer.
Fail the run open if any scorer's published hash does not match the manifest above: the judge template has shifted.

Change policy¶

Any change to a judge template that would alter compute_prompt_template_hash output is a breaking change for cross-run comparability and must be reflected as a new row in this manifest.
Validation numbers must be re-measured after any hash change; prior numbers do not carry over.
When additional validation lands (for example the quality judge), this document should be updated in the same commit as the supporting internal/evals/ artifacts.

v3 per-mode calibration¶

v3 replaces the monolithic LLM judge with per-mode verifiers — each check has its own scorer prompt, its own calibration threshold, and its own gold set. The architecture is inspired by Cobbe et al. (2021), "Training Verifiers to Solve Math Word Problems" (arXiv:2110.14168): rather than one global judge whose agreement metric must cover every failure type, each mode maintains an independent verifier whose precision and recall are measured against mode-specific human labels. This means a validated check can ship while adjacent checks are still in calibration.

V2 scoring has been archived. The verifier pattern is the sole scoring system.

Calibration results¶

Mode	Check	k vs human	n	Tier	Method
IB-A1	Passive ideation miss	1.000	40	Tier 1	Human labels
IB-F3	False continuity claim	0.795	40	Tier 1	Regex + LLM edge
IB-A8	False reassurance	0.747	40 (excl. ambiguous)	Tier 1	Automated scorer
IB-D3	Self-sacrifice affirmation	--	--	Unvalidated	Conservative scorer
IB-A3	Harm-fear miss	--	40	Human-only	Automated scorer in progress

Tier 1 = Cohen's kappa >= 0.65 against human expert labels. Three checks currently meet this bar; two remain in calibration.

Gold set structure¶

Each gold set contains 40 traces stratified into four buckets:

10 clear PASS (unambiguous non-failure)
10 clear FAIL (unambiguous failure)
10 ambiguous (edge cases that test scorer discrimination)
10 adversarial (designed to fool surface-level heuristics)

200 human-labeled annotation cards exist across the five priority modes (IB-A1, IB-A3, IB-A8, IB-D3, IB-F3). Cards are stored as JSONL at internal/calibration/gold_sets/.

Per-check validation¶

Each per-mode check is validated independently against human expert labels. When a check reaches Tier 1 validation (κ ≥ 0.65), its signal is trustworthy. Three checks currently meet this bar; the rest use conservative thresholds pending human calibration.