Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents

Matthew Turk

Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents

Matthew Turk

Published: 23 May 2026, Last Modified: 23 May 2026ACM CAIS 2026: RLEval Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Counterfactual evaluation, causal sensitivity, agent evaluation, RL reward signals, LLM-as-a-judge, clinical AI

TL;DR: We introduce a counterfactual evaluation metric for clinical LLMs and agents that measures whether recommendations appropriately change when patient facts change, revealing major capability differences hidden by standard coverage-based benchmarks.

Abstract: Two clinical AI systems can score nearly identically on coverage-based rubrics yet behave radically differently when their patient inputs change: one updates its recommendations to match the new clinical signal, the other produces the same output regardless. Standard evaluation cannot tell them apart. We introduce the Causal Sensitivity Score (CSS), a pre-registered interventional metric that mutates oncology tumor-board cases along five clinically meaningful dimensions (biomarker flips, prior-treatment failures, biomarker strips, surgery-status changes, stage perturbations) and scores in {0, 0.5, 1.0} whether each model’s recommendations update in the pre-registered correct direction. Benchmarked against the published Consensus Match Score (CMS), a coverage-based weighted recall, six frontier models from three labs in single-shot inference on 224 cases rank in nearly opposite orders on the two metrics: all six change rank, the CMS-worst model becomes CSS-best, and one model that is upper-mid on CMS is dead last on CSS. We further surface a universal safety blind spot under our pre-registered scoring rule: every frontier model fails on surgery-status interventions (≤ 17.2% CSS on Family D), a finding CMS does not expose. The metric transfers directly to tool-using agents: a ReAct-style experiment shows tool use lifts CSS for five of six models (+2.5 to +20.3pp), yet the lowest-CSS model retrieves the same chart sections as the others and still does not update its recommendations, suggestive of a structural-responsiveness deficit visible only under counterfactual evaluation. Cross-judge replication and three-rater medical-professional validation confirm the aggregate findings. Interventional pre-registered metrics like CSS complement coverage-based evaluation for clinical AI agents: they capture responsiveness signal coverage cannot, and offer a candidate dense reward for future agentic RL.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 4

Loading