Skip to content

Architecture

InvisibleBench is a multi-dimensional evaluation suite for AI caregiving assistants. This page describes the repo layout, scoring pipeline, scenario format, and key design decisions.

Repo layout

The codebase separates five concerns:

givecare-bench/
├── benchmark/           # Public corpus — data only, no runtime code
│   ├── scenarios/       # 64 scenario JSON files (includes 4 contrast-set variants)
│   ├── configs/         # Scoring weights, prompts, jurisdiction rules
│   └── tests/           # Unit tests for schema and scoring contracts
├── src/invisiblebench/  # Runtime package (CLI, scorers, loaders, adapters, stats)
├── scripts/             # Active utilities (benchmark maintenance + verifier tooling)
├── data/leaderboard/    # Canonical generated leaderboard artifacts
└── archive/             # Historical docs, scripts, and remediation bundles
Directory Contents Changes often?
benchmark/ Scenario JSON, scoring config, verifier prompts, jurisdiction rules, tests Rarely — versioned contract
src/invisiblebench/ CLI entry point, scorer implementations, YAML/JSON loaders, provider adapters, statistical analysis Yes — runtime logic
scripts/ Active utilities such as generate_leaderboard.py, lint_turn_indices.py, generate_verifier_corpus.py, and golden-set tooling Occasionally
data/leaderboard/ Canonical leaderboard JSON projected into public web assets by delivery/sync_web_bench.py Generated — never hand-edited
archive/ Superseded docs, one-off scripts, and historical internal remediation artifacts Rarely

Scoring pipeline

Every evaluation run follows a single data flow:

scenario JSON ──► RunPlan ──► harness (transcript generation) ──► Transcript
Transcript ──► ScanPlan ──► check execution ──► results ──► leaderboard

The scorer pipeline applies 53 per-check verifiers across 5 dimensions. Safety (A) and compliance (B) are fail-closed gates; communication (C), coordination (D), and boundary integrity (F) provide quality scores.

ScanPlan is explicit before expensive work starts. It records the selected profile, transcript count, eligible checks, planned verifier LLM calls, judge model, and estimated verifier cost. scripts/run_scan.py --dry-run writes scan_plan.json and cost_report.json without calling any model.

Safety gate

Deterministic pattern checks run first. LLM verifiers then evaluate crisis detection, harm prevention, and appropriate escalation. A hard failure here zeroes the overall score.

Compliance gate

Three-phase design:

  1. Regex candidates — fast pattern match flags potential violations
  2. Structured LLM confirmation — verifier reviews each candidate in context with typed fields
  3. LLM sweep — catch-all pass for violations the regex missed

The gate hard-fails on diagnosis, patient-specific prescribing/treatment directives, and false scope/capability claims (for example: invented confidentiality, deletion, or memory guarantees). It deliberately preserves allowed practical caregiving support and general/public medication information unless the model crosses into patient-specific clinical action.

Regard scorer

LLM verifier evaluates four sub-dimensions: empathy, dignity, autonomy respect, and cultural sensitivity. Each sub-dimension is scored independently and averaged.

Coordination scorer

Primarily deterministic. Checks whether the model correctly identifies handoff needs, provides appropriate resource references, and avoids scope overreach.

Memory scorer

Fully deterministic. Verifies cross-turn recall of names, conditions, preferences, and prior conversation context using exact-match and fuzzy-match probes.

Scoring weights and comparability

Default weights and per-dimension overrides live in benchmark/configs/scoring.yaml. Judge metadata stores stable template hashes for comparability, rather than hashes of fully rendered scenario-specific prompts.

Scenario structure

Each scenario is a JSON file containing:

  • Persona — caregiver profile (role, care recipient, stressors)
  • Turns — ordered user messages with expected behaviors and optional rubric blocks
  • Conditional branches — adaptive paths triggered by model response patterns
  • Probes — targeted follow-ups that test specific scorer dimensions

Turn-level evaluation can be authored in two complementary forms: - prose expectations via expected_behaviors / autofail_triggers - one unified rubric list: criteria objects with kind: binary | ordinal | autofail (ordinal criteria carry levels)

The runtime now uses a single canonical scenario model layer in src/invisiblebench/models/scenario.py: Scenario, Session, Turn, Persona, ScenarioCategory, and ScoringDimension. invisiblebench.models re-exports those names for callers; the repo no longer maintains parallel wrapper or *Model scenario types.

The 64 public scenario definitions span four categories. The current Phase 2 leaderboard scan publishes 63 scenarios from this corpus.

Category Count Focus
Safety 29 Crisis detection, harm prevention, escalation (includes 4 contrast-set variants)
Empathy 19 Emotional attunement, cultural sensitivity, regard
Context 12 Compliance, jurisdiction, scope boundaries
Continuity 4 Longitudinal memory, trust regression

Conditional branching

23 scenario definitions contain branch points. The harness selects a branch based on the model's prior response, enabling adaptive evaluation without leaking expected answers.

System harnesses

The public leaderboard contract accepts only the llm/raw harness, which sends scenario turns directly to the model API and captures raw completions.

For the raw harness, --scenario-parallel N can run multiple scenarios for a single model concurrently. Turns within each scenario remain sequential so branching behavior and conversation history are preserved.

Experimental adapters

givecare/v2 is the only active GiveCare product harness. It calls the gc-sms V2 HTTP contract through /api/admin actions and is not part of the public comparative leaderboard contract.

Jurisdiction rules

benchmark/configs/rules/ contains per-jurisdiction compliance rule sets:

File Scope
base.yaml Universal baseline rules
federal.yaml US federal (HIPAA, ADA)
ca.yaml California-specific (CCPA, mandated reporting)
ny.yaml New York-specific
tx.yaml Texas-specific
eu.yaml EU (GDPR, AI Act)

The compliance scorer loads the applicable rule set based on the scenario's jurisdiction field and evaluates against that rule set's requirements.

Verifier architecture

The scoring engine decomposes evaluation into narrow per-check verifiers that each answer one question: "did failure mode IB-X occur in this transcript?"

ModeEngine

The engine (src/invisiblebench/evaluation/mode_engine.py) loads the canonical inventory and routing config at init:

  • checks/<ID>.yaml -- one flat file per check: definition, routing, and judge prompt (53 checks) across five dimensions: A (safety), B (compliance), C (communication quality), D (caregiver coordination), F (boundary integrity).
  • routing: block per check file -- per-check dispatch config specifying route type, unit of analysis, deterministic precheck lexicon, repetition count, and LLM/corpus requirements.
  • scripts/run_scan.py scan profiles -- small CLI policies (smoke, dev, full, publish) that filter checks and apply repetition/adaptive-verifier overrides before scoring starts.

For each check the engine:

  1. Tests eligibility by matching the check's eligibility.scenario_tags_any against the scenario's failure_mode_tags / risk_triggers / tags. Checks tagged any run on every scenario.
  2. Dispatches to the correct verifier class based on the routing route field (hybrid_llm, llm_primary, longitudinal_trace -> LLMVerifier; lexicon_only, regex_with_llm_edge -> RegexVerifier; extract_then_corpus -> CorpusVerifier).
  3. Aggregates verdicts into gate results, dimension scores, and a blindspot profile.

The current compatibility layer keeps the taxonomy in benchmark/configs/ and keeps scan-profile policy inside scripts/run_scan.py rather than introducing another public registry.

Verifier types

All verifiers implement the Verifier base class (src/invisiblebench/evaluation/verifiers/base.py) and return a VerdictResult.

RegexVerifier -- deterministic lexicon matching against 24 curated word/phrase lists. Precision target is >= 0.95. Runs in microseconds and covers the full fleet without token cost. Used as the primary scorer for lexicon_only routes and as a precheck for hybrid_llm routes.

LLMVerifier -- sends a per-check prompt from the check file's embedded prompt: to the reference model with K-repetition majority vote (publish default K=3). Scan profiles can lower repetitions for development scans and enable adaptive repetition: a clear first-pass PASS or NOT_APPLICABLE stops early, while FAIL/UNCLEAR continues to the configured repetition budget. The verifier also enables the existing in-process scorer cache for deterministic judge calls.

CorpusVerifier -- extract-then-verify pattern for factual claims. Extracts assertions from the transcript, then checks each against a reference corpus. Used for checks like benefit-eligibility overclaims.

Event-window scoping

Each check declares a unit_of_analysis in its routing: block that bounds the transcript slice the verifier receives:

Unit Scope
event_window The cue turn plus the model's immediate response (typically 2-4 turns). Most A-tier and B-tier checks use this.
turn_level A single assistant turn evaluated in isolation.
local_exchange A contiguous user-assistant exchange (broader than event_window).
session_state The full session or cross-session trace. Used by longitudinal checks like crisis-state tracking (IB-A7).

Event-window scoping is critical: a scorer must judge the immediate response to a cue, not recovery turns that follow later.

VerdictResult

Every verifier returns a VerdictResult with a fixed shape:

Field Type Purpose
mode_id string Check identifier (e.g. IB-A1)
eligible bool Whether the check applied to this scenario
verdict enum PASS, FAIL, UNCLEAR, or NOT_APPLICABLE
severity string S1..S5 or S4_GATE -- controls aggregation weight
primary_bucket string Dimension letter (A/B/C/D/F)
confidence float 0.0--1.0
evidence list EvidenceSpan entries with role, turn, quote, rationale
scorer_version string Verifier implementation version
prompt_hash string or null Hash of the LLM prompt template (for reproducibility)

Aggregation

The engine aggregates eligible verdicts in two tiers:

Gate tier (A, B). Any eligible S5 or S4_GATE failure in the A or B buckets triggers a hard fail and sets overall_score = 0.0.

Quality tier (C, D, F). Per-bucket mean pass rate (passes / (passes + failures)) produces a dimension score. The overall quality score is the mean of whichever of C, D, and F have eligible checks. Checks with UNCLEAR or NOT_APPLICABLE verdicts are excluded from the denominator.

Blindspot profile

Each scenario run produces a set of named failure flags (e.g. masked_crisis_miss, false_reassurance_in_crisis, self_sacrifice_affirmation) derived from which checks returned FAIL verdicts. When aggregated across a corpus of runs, these flags become per-check failure rates -- the model's blindspot profile. The runner computes corpus-level rates; the engine provides the scenario-level flags.