Evidence Development Program
Research & Validation
JRS is in operational validation: an ongoing effort to examine whether structured pre-finalization review can identify Decision Reconstruction Risk, the condition in which a record cannot explain why a consequential decision was made, and support decision defensibility that existing review does not systematically catch. This page registers the program's studies and methodology and reports their honest current status. The goal is not to prove JRS works; it is to observe decision reconstructability, reviewer behavior, framework ambiguity, recognition patterns, and how evidence develops.
This Week's Observation
Scope: the current automated observation is a cross-vendor reproducibility check. Three independent AI models from different vendors each judge the same set of constructed (synthetic) records, and the run reports how often they agree. It is not accuracy, not validation, and not evidence about real workplace records.
No observation published yet. One appears here automatically once a study run completes.
Questions Emerging From the Data
We publish questions, not conclusions. Each is open and under investigation. Follow a question rather than a report.
Validation Maturity
Current Stage
Operational Validation
Early. Collecting reviewer observations; no validated findings yet.
Condition Maturity
Experimental
Readiness Scores
Not Established
Require sufficient multi-reviewer and benchmark data.
Current Findings
Study Registry
Study 001 · AI Reproducibility (cross-vendor, synthetic)
ActiveThree independent AI models from different vendors each judge the same set of constructed (synthetic) records, and the nightly run reports how often the models agree. It involves no human reviewers and no ground-truth labels, so it cannot speak to accuracy or to real records. Independent vendors agreeing is a stronger signal than one model repeating itself, but agreement is still not accuracy and not validation.
Study 002 · Ground-Truth Benchmark
PlannedCompare review outputs against expert benchmark mappings to measure accuracy (distinct from agreement). Requires a benchmark dataset and execution engine.
Study 003 · Condition Performance
Design completeTrack per-condition agreement and dispute rates from challenge and reviewer data as volume grows.
Study 004 · Framework Ambiguity
Design completeEstimate reviewer variance and record ambiguity once multiple reviewers assess the same records.
Study 005 · Continuous Replication
PlannedRe-run reviews on a schedule to track drift and stability across framework versions. Requires scheduled backend compute.
Study 006 · Participant Recognition
CollectingRecognition patterns from the One-Minute Challenge and Extended Review. Data accruing now; reported once the sample is adequate.
Study 007 · Learning Effect
PlannedCompare first vs. subsequent attempts. Requires repeat-participant identification across sessions.
Study 008 · Professional Reviewer
CollectingRole and profession are captured at participation; patterns by reviewer type reported as numbers grow.
Study 009 · Organizational-Psychology Readiness
Building datasetAssembling reliability, difficulty, agreement, and behavioral datasets for independent organizational-psychology review.
Study 010 · Criterion Validity (real-outcome)
CollectingDe-identified public determinations are paired with their documented real-world outcomes (upheld, overturned, challenged) to test whether JRS reads correspond to results when a record is contested. Cases are accruing across HR, public-records, and related domains in small batches. No results are reported until the sample is adequate.
Study 011 · AI-Assisted Records Detection
CollectingWhether JRS distinguishes AI-generated records whose conclusions are grounded in their source from records that read convincingly but are not, judged blind by independent reviewers against a held-out key. Constructed stimuli with known ground truth; reviewer reads are accruing. Detection of a known AI documentation risk, not accuracy or validation.
Status legend. Collecting = real data accruing from live participation. Design complete = methodology defined; awaiting sufficient data. Planned = requires infrastructure not yet in place (e.g., model API access or scheduled compute). No study reports validated findings; this is a validation-phase program.
What Would Count as Evidence (and What Would Falsify a Claim)
A JRS claim of usefulness would be supported only if independent reviewers, applying the five conditions to records they did not author, identify deficiencies that standard review misses, and agree with one another above chance. It would be weakened or falsified if reviewers cannot apply the conditions consistently, if flagged records are no less reviewable than unflagged ones under expert assessment, or if agreement is no better than chance. The current automated reproducibility check (independent models, synthetic records) does not bear on any of this; it only measures whether independent models agree on constructed records, not whether they are correct. Accuracy requires a ground-truth benchmark (Study 002); reliability requires multiple human reviewers (Study 004). Neither exists yet, which is why nothing here is presented as validated.
Origin & Approach
JRS originated from civil rights investigative and documentation-review experience, with a cognitive-behavioral and AI-governance lens. Read the origin and what JRS is and is not →
Program Layers
Claim control: reproducibility, accuracy, and validation are distinct and are not treated as equivalent anywhere in this program. Figures shown are observational, not statistical findings or validated data.