DEDUCTIVE CONSTRAINT SATISFACTION VS. PREVALENCE PRIORS: BENCHMARKING LLM LOGIC IN CLINICAL DIAGNOSTICS
Track: tiny / short paper (up to 4 pages)
Keywords: Large Language Models, Logical Reasoning, Constraint Satisfaction, Prevalence Anchoring, Clinical Diagnostics
TL;DR: Our paper reveals a critical flaw in frontier LLMs for clinical diagnostics: when explicit logical constraints point to a rare target, models abandon strict deductive reasoning in favor of prevalence-driven statistical guessing
Abstract: Large language models are increasingly evaluated for complex decision-making, yet their ability to maintain logical invariance when deductive constraints conflict with statistical priors remains poorly characterized. We introduce DiagRare, a structured deductive benchmark of 696 clinical vignettes grounded in an expert-curated 58-disease biomedical ontology. By explicitly controlling present (modus ponens) and absent (modus tollens) evidence, DiagRare isolates logical constraint satisfaction from associative pattern matching. Evaluating five frontier LLMs reveals a stark 31.6% to 100% Top-1 accuracy spread and exposes three phenomena. First, the weakest model exhibits prevalence-driven mode collapse, funneling 278 of 476 errors into a single common diagnosis spanning 45 unrelated diseases (diagnostic entropy collapsing to 3.33/5.86 bits). Second, mid-tier reasoning models exhibit an inductive downgrade: Claude Sonnet 4.6 achieves 99.7% Top-3 accuracy on rare diseases but only 86.3% Top-1, indicating prevalence priors corrupt the final ranking of correctly deduced candidates. Third, we establish a deductive asymmetry: increasing explicit negative constraints monotonically improves Claude's accuracy from 84.5% to 97.0%, while increasing positive findings degrades it, demonstrating that modus tollens binds statistical priors more effectively than modus ponens.
Presenter: ~Dharini_Raghavan1
Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 163
Loading