Keywords: human disagreement, rating distribution, model calibration, source of miscalibration
Abstract: Model calibration measures how well the model predicted probabilities align with their empirical accuracy. Recent studies show that deep learning models are often overconfident, and prior works primarily attribute this phenomenon to architectural and optimization-related factors.
In this work, we argue that miscalibration also originates from the supervision target itself. To construct ground-truth labels, standard supervised learning pipelines commonly aggregate annotations from multiple annotators into a single label or simplify fine-grained judgments. We show that these operations discard uncertainty inherent in human judgments and lead to model miscalibration.
In this study, we investigate the impact of preserving versus collapsing annotation uncertainty during training. Our results show that preserving annotation uncertainty substantially improves model calibration, achieves stronger predictive performance, and better reflects human judgments. These findings suggest that calibration depends not only on model architecture and optimization, but also on how uncertainty in human judgments is represented in the training signal. Notably, our temperature scaling experiments show that preserving annotation uncertainty during training largely eliminates the need for post-hoc calibration.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 59
Loading