Abstract: We investigate the role of uncertainty quantification in aiding medical decision-making. Existing evaluation metrics fail to capture the practical utility of joint human-AI decision-making systems. To address this, we introduce a novel framework to assess such systems and use it to benchmark a diverse set of confidence and uncertainty estimation methods. Our results show that certainty measures enable joint human-AI systems to outperform both standalone humans and AIs, and that for a given system there exists an optimal balance in the number of cases to refer to humans, beyond which the system’s performance degrades.