On Arbitrary Predictions From Equally Valid Models

Published: 24 Dec 2025, Last Modified: 24 Dec 2025MURE Workshop OralEveryoneRevisionsBibTeXCC BY 4.0
Track: Main track
Published Or Accepted: false
Keywords: predictive multiplicity, model multiplicity, Rashomon Effect, Rashomon Set
TL;DR: We analyze predictive multiplicity at the patient level across medical prediction tasks, showing how equally valid models produce arbitrary predictions and how this can be mitigated.
Abstract: Model multiplicity describes the existence of multiple models that fit the data equally well but can produce different predictions on individual samples, so-called predictive multiplicity. In medicine, these models can admit conflicting predictions for the same patient---a risk that is poorly understood and insufficiently addressed. In this study, we empirically analyze predictive multiplicity across multiple medical tasks and model architectures, and show practical strategies to mitigate it. Our analysis reveals that (1) standard validation metrics fail to identify a uniquely optimal model. (2) Functionally equivalent models show variability in patient-level predictions, resulting in arbitrary and potentially harmful outcomes under any single model. However, predictive multiplicity does not affect samples equally and the converse can be exploited to reduce predictive multiplicity. In contrast to previous research, we find that (3) high model capacity decreases predictive multiplicity by improving accuracy. (4) Ensembles with an abstention strategy further enhance both expected per-sample accuracy and stability. These findings highlight that predictive multiplicity is not a theoretical curiosity but a pervasive and practically significant issue in medical AI. We argue that accounting for multiplicity should be considered a core component of model evaluation and deployment in safety-critical domains.
Submission Number: 10
Loading