The Confidence Paradox: Unveiling the Latent Discriminative Power of Diffusion Large Language Models in Mathematical Reasoning

The Confidence Paradox: Unveiling the Latent Discriminative Power of Diffusion Large Language Models in Mathematical Reasoning

ACL ARR 2026 January Submission3117 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: diffusion large language models, masked diffusion LM, uncertainty estimation, calibration, AUROC, ECE, mathematical reasoning, selective prediction, temperature scaling

Abstract: Diffusion large language models (DLLMs) have emerged as a promising alternative to autoregressive (AR) generation, uniquely offering token-level probabilities under bidirectional context. However, the semantics of their native uncertainty estimates remain underexplored. In this work, we uncover a calibration paradox inherent to the bidirectional generation mechanism of state-of-the-art DLLMs like LLADA-8B. Concretely, we demonstrate that diffusion confidence is structurally distinct from AR likelihood: on mathematical reasoning benchmarks, it is highly miscalibrated (31.2% ECE) yet possesses superior discriminative power (0.826 AUROC), significantly outperforming comparable AR baselines in single-pass settings (0.611 AUROC). We diagnose that this paradox arises because diffusion confidence functions less as a probability of correctness and more as a proxy for structural consistency enabled by the model’s bidirectional access to the entire solution path. We further show that lightweight post-hoc calibration can reconcile this gap, reducing ECE by over 60% while preserving the strong ranking signal. Our findings suggest that DLLMs offer a unique, cost-efficient uncertainty signal for reasoning tasks that complements expensive AR approaches.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: calibration/uncertainty, robustness, probing, hardness of samples, adversarial attacks/examples/training

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 3117

Loading