Evidence Development Program

Research & Validation

JRS is in operational validation: an ongoing effort to examine whether structured pre-finalization review can identify Decision Reconstruction Risk, the condition in which a record cannot explain why a consequential decision was made, and support decision defensibility that existing review does not systematically catch. This page registers the program's studies and methodology and reports their honest current status. The goal is not to prove JRS works; it is to observe decision reconstructability, reviewer behavior, framework ambiguity, recognition patterns, and how evidence develops.

This Week's Observation

Scope: the current automated observation is a cross-vendor reproducibility check. Three independent AI models from different vendors each judge the same set of constructed (synthetic) records, and the run reports how often they agree. It is not accuracy, not validation, and not evidence about real workplace records.

No observation published yet. One appears here automatically once a study run completes.

Questions Emerging From the Data

We publish questions, not conclusions. Each is open and under investigation. Follow a question rather than a report.

Full registry: Questions Under Investigation →

Validation Maturity

Current Stage

Operational Validation

Early. Collecting reviewer observations; no validated findings yet.

Evidence Base

Forming

Live aggregate counts at results.html.

Condition Maturity

Experimental

All five conditions. See the Codebook.

Readiness Scores

Not Established

Require sufficient multi-reviewer and benchmark data.

Current Findings

Loading findings…

Study Registry

Study 001 · AI Reproducibility (cross-vendor, synthetic)

Active

Three independent AI models from different vendors each judge the same set of constructed (synthetic) records, and the nightly run reports how often the models agree. It involves no human reviewers and no ground-truth labels, so it cannot speak to accuracy or to real records. Independent vendors agreeing is a stronger signal than one model repeating itself, but agreement is still not accuracy and not validation.

Study 002 · Ground-Truth Benchmark

Planned

Compare review outputs against expert benchmark mappings to measure accuracy (distinct from agreement). Requires a benchmark dataset and execution engine.

Study 003 · Condition Performance

Design complete

Track per-condition agreement and dispute rates from challenge and reviewer data as volume grows.

Study 004 · Framework Ambiguity

Design complete

Estimate reviewer variance and record ambiguity once multiple reviewers assess the same records.

Study 005 · Continuous Replication

Planned

Re-run reviews on a schedule to track drift and stability across framework versions. Requires scheduled backend compute.

Study 006 · Participant Recognition

Collecting

Recognition patterns from the One-Minute Challenge and Extended Review. Data accruing now; reported once the sample is adequate.

Study 007 · Learning Effect

Planned

Compare first vs. subsequent attempts. Requires repeat-participant identification across sessions.

Study 008 · Professional Reviewer

Collecting

Role and profession are captured at participation; patterns by reviewer type reported as numbers grow.

Study 009 · Organizational-Psychology Readiness

Building dataset

Assembling reliability, difficulty, agreement, and behavioral datasets for independent organizational-psychology review.

Study 010 · Criterion Validity (real-outcome)

Collecting

De-identified public determinations are paired with their documented real-world outcomes (upheld, overturned, challenged) to test whether JRS reads correspond to results when a record is contested. Cases are accruing across HR, public-records, and related domains in small batches. No results are reported until the sample is adequate.

Study 011 · AI-Assisted Records Detection

Collecting

Whether JRS distinguishes AI-generated records whose conclusions are grounded in their source from records that read convincingly but are not, judged blind by independent reviewers against a held-out key. Constructed stimuli with known ground truth; reviewer reads are accruing. Detection of a known AI documentation risk, not accuracy or validation.

Status legend. Collecting = real data accruing from live participation. Design complete = methodology defined; awaiting sufficient data. Planned = requires infrastructure not yet in place (e.g., model API access or scheduled compute). No study reports validated findings; this is a validation-phase program.

What Would Count as Evidence (and What Would Falsify a Claim)

A JRS claim of usefulness would be supported only if independent reviewers, applying the five conditions to records they did not author, identify deficiencies that standard review misses, and agree with one another above chance. It would be weakened or falsified if reviewers cannot apply the conditions consistently, if flagged records are no less reviewable than unflagged ones under expert assessment, or if agreement is no better than chance. The current automated reproducibility check (independent models, synthetic records) does not bear on any of this; it only measures whether independent models agree on constructed records, not whether they are correct. Accuracy requires a ground-truth benchmark (Study 002); reliability requires multiple human reviewers (Study 004). Neither exists yet, which is why nothing here is presented as validated.

Origin & Approach

JRS originated from civil rights investigative and documentation-review experience, with a cognitive-behavioral and AI-governance lens. Read the origin and what JRS is and is not →

Program Layers

Findings & Discussion

Latest finding · discuss

Questions Under Investigation

Follow the questions

JRS Codebook

Measurement instrument · v1.0

Claim control: reproducibility, accuracy, and validation are distinct and are not treated as equivalent anywhere in this program. Figures shown are observational, not statistical findings or validated data.