Research, RocSite Discovery

Preprint sepsis falsification · 286,510 encounters · Protocol drafted 2026-01-30

Falsification testing reveals sepsis prediction benchmarks measure hospital billing codes, not biological sepsis

Key finding, Across 286,510 ICU encounters in the MIMIC-IV and eICU-CRD public datasets (eICU-CRD covers ~209 US hospitals), commonly cited sepsis prediction benchmarks correlate more strongly with hospital coding practice than with any biological marker of sepsis. The pre-registered falsification protocol replicates on a held-out cohort. The implication: model "accuracy" numbers reported in the leading benchmarks are partially a measurement of administrative artifacts.

Submission target: medRxiv (pending) OSF umbrella project Methodology

[OSF support review pending, see verification mirror]

Read abstract

Background. Published sepsis AI models routinely report AUC > 0.90 on ICU benchmarks. We pre-registered a falsification protocol to test whether these benchmarks measure biological sepsis or administrative artifacts. Methods. 286,510 ICU encounters from MIMIC-IV and eICU-CRD. Labels were regenerated using three independent sepsis definitions (Sepsis-3 clinical criteria, biomarker thresholds, and billing-code derived). Model performance was compared across label regimes. Results. Models trained on billing-code labels retain AUC > 0.85 even when key biological features are removed, while models trained on Sepsis-3 clinical criteria degrade to AUC 0.68 under the same ablation. The gap replicates on eICU. Performance on the billing benchmark is largely explained by care-process features observed after sepsis onset. Conclusion. The leading sepsis benchmarks measure hospital coding practice more than biological sepsis. Model evaluation should separate these signals before clinical claims are made.

Preprint community-hospital workload · 136,864 encounters · Protocol drafted 2026-02-12

Community-hospital sepsis workload: a 136,864-encounter study of institution-type bias in ICU AI

Key finding, Sepsis AI trained on academic medical-center data underperforms by 11–19 AUC points when deployed to community hospitals, and the performance gap does not close with standard fine-tuning. Institution-type is a structural confound in most public sepsis datasets.

Submission target: medRxiv (pending) OSF umbrella project Methodology

[OSF support review pending, see verification mirror]

Read abstract

Background. Public ICU datasets are dominated by academic medical centers, but the majority of US ICU admissions happen in community hospitals. Methods. 136,864 encounters from eICU-CRD stratified by hospital type. We trained sepsis models on AMC-only cohorts and evaluated on community cohorts, and vice versa. Results. Cross-institution-type transfer loses 11–19 AUC points. The gap persists after standard fine-tuning and is partially attributable to differences in nursing documentation cadence, lab-test ordering patterns, and antibiotic-administration timing , all of which are inputs to common sepsis models. Conclusion. Institution-type is a first-class covariate in ICU AI evaluation. FDA submissions relying on AMC-trained models should include community-hospital validation cohorts as a standard practice.

Preprint ICU mortality miscalibration · 201,905 encounters · Protocol drafted 2026-03-10

ICU mortality miscalibration: published estimates underestimate elderly risk by 66–168% across three conditions

Key finding, Across the MIMIC-IV (Beth Israel Deaconess) and eICU-CRD (208 US hospitals) public datasets — 201,905 total ICU encounters — published ICU mortality estimates for elderly patients are underestimates by 66.3% (AF), 131.7% (diabetes), and 168.4% (MI) simultaneously. The miscalibration replicates across two independent datasets and six condition × age strata.

Submission target: Research Square (pending) OSF umbrella project Findings dashboard

[OSF support review pending, see verification mirror]

Read abstract

Background. Clinical risk stratification tools rely on published ICU mortality estimates that are often decades old, cohort-specific, or derived from meta-analyses with limited subgroup resolution. Methods. Pre-registered observational study, 201,905 ICU encounters from MIMIC-IV and eICU-CRD. Six primary strata: Diabetes, MI, AF, Seizure, PE × age > 70; COPD × age < 50. Observed mortality computed and compared to consensus-published estimates. Results. Five of six strata show divergences exceeding 60%, cross-validated on the independent dataset. The seizure × elderly stratum shows a +510% divergence, the largest literature gap in the series. COPD × young adults shows the inverse (−63.6%, suggesting published estimates are inflated by selection bias). Conclusion. Bedside risk stratification using literature-derived mortality estimates is systematically miscalibrated for elderly ICU patients. Clinical AI trained on or calibrated to these estimates inherits the same miscalibration.

Why only three papers? Because every paper on this page was either written by Adam or explicitly approved by him before publication. We do not aggregate the field's output. This is a library of what we stand behind. The larger pipeline of engine-generated findings, the ones not yet written up as peer-reviewed manuscripts, lives on the Findings dashboard.