In review
medRxiv · JAMIA Open
·
286,510 encounters
·
Pre-registered OSF 2026-01-30
Falsification testing reveals sepsis prediction benchmarks measure hospital billing codes, not biological sepsis
Key finding — Across 286,510 ICU encounters from 209 hospitals, commonly cited
sepsis prediction benchmarks correlate more strongly with hospital coding practice than with any
biological marker of sepsis. The pre-registered falsification protocol replicates on a held-out
cohort. The implication: model "accuracy" numbers reported in the leading benchmarks are partially
a measurement of administrative artifacts.
Read abstract
Background. Published sepsis AI models routinely report AUC > 0.90 on ICU
benchmarks. We pre-registered a falsification protocol to test whether these benchmarks measure
biological sepsis or administrative artifacts.
Methods. 286,510 ICU encounters from MIMIC-IV and eICU-CRD. Labels were
regenerated using three independent sepsis definitions (Sepsis-3 clinical criteria, biomarker
thresholds, and billing-code derived). Model performance was compared across label regimes.
Results. Models trained on billing-code labels retain AUC > 0.85 even
when key biological features are removed, while models trained on Sepsis-3 clinical criteria
degrade to AUC 0.68 under the same ablation. The gap replicates on eICU. Performance on the
billing benchmark is largely explained by care-process features observed after sepsis onset.
Conclusion. The leading sepsis benchmarks measure hospital coding practice
more than biological sepsis. Model evaluation should separate these signals before clinical
claims are made.
In review
npj Digital Medicine
·
136,864 encounters
·
Pre-registered OSF 2026-02-12
Community-hospital sepsis workload: a 136,864-encounter study of institution-type bias in ICU AI
Key finding — Sepsis AI trained on academic medical-center data underperforms by
11–19 AUC points when deployed to community hospitals, and the performance gap does not close
with standard fine-tuning. Institution-type is a structural confound in most public sepsis datasets.
Read abstract
Background. Public ICU datasets are dominated by academic medical centers,
but the majority of US ICU admissions happen in community hospitals.
Methods. 136,864 encounters from eICU-CRD stratified by hospital type.
We trained sepsis models on AMC-only cohorts and evaluated on community cohorts, and vice versa.
Results. Cross-institution-type transfer loses 11–19 AUC points.
The gap persists after standard fine-tuning and is partially attributable to differences in
nursing documentation cadence, lab-test ordering patterns, and antibiotic-administration timing
— all of which are inputs to common sepsis models.
Conclusion. Institution-type is a first-class covariate in ICU AI evaluation.
FDA submissions relying on AMC-trained models should include community-hospital validation
cohorts as a standard practice.
In review
Research Square · NEJM submitted
·
201,905 encounters
·
Pre-registered OSF 2026-03-10
ICU mortality miscalibration: published estimates underestimate elderly risk by 66–168% across three conditions
Key finding — Across MIMIC-IV and eICU-CRD (201,905 total encounters, 209 hospitals),
published ICU mortality estimates for elderly patients are underestimates by
66.3% (AF), 131.7% (diabetes), and 168.4% (MI) simultaneously. The miscalibration replicates
across two independent datasets and six condition × age strata.
Read abstract
Background. Clinical risk stratification tools rely on published ICU mortality
estimates that are often decades old, cohort-specific, or derived from meta-analyses with
limited subgroup resolution.
Methods. Pre-registered observational study, 201,905 ICU encounters from
MIMIC-IV and eICU-CRD. Six primary strata: Diabetes, MI, AF, Seizure, PE × age > 70;
COPD × age < 50. Observed mortality computed and compared to consensus-published estimates.
Results. Five of six strata show divergences exceeding 60%, cross-validated
on the independent dataset. The seizure × elderly stratum shows a +510% divergence,
the largest literature gap in the series. COPD × young adults shows the inverse (−63.6%,
suggesting published estimates are inflated by selection bias).
Conclusion. Bedside risk stratification using literature-derived mortality
estimates is systematically miscalibrated for elderly ICU patients. Clinical AI trained on
or calibrated to these estimates inherits the same miscalibration.
Why only three papers? Because every paper on this page was either written by
Adam or explicitly approved by him before publication. We do not aggregate the field's output.
This is a library of
what we stand behind. The larger pipeline of engine-generated
findings — the ones not yet written up as peer-reviewed manuscripts — lives on the
Findings dashboard.