AtC: Aggregate-then-Calibrate for Human-centered Assessment

Zejun Xie; Xintong Li; Guang Wang; Desheng Zhang

AtC: Aggregate-then-Calibrate for Human-centered Assessment

Zejun Xie, Xintong Li, Guang Wang, Desheng Zhang

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: human-centered assessment, judgment aggregation, calibration, misspecification, human-AI complementarity, human-AI collaboration

TL;DR: AtC aggregates human comparisons into a consensus $\hat{\pi}$ and isotonic-calibrates any model’s scores via ${\hat{\pi}}$, delivering decision-ready assessments with guarantees on efficiency, robustness, and optimality.

Abstract: Human-centered assessment tasks, which are essential for systematic decision-making, rely heavily on human judgment and typically lack verifiable ground truth. Existing approaches face a dilemma: methods using only human judgments suffer from heterogeneous expertise and inconsistent rating scales, while methods using only model-generated scores must learn from imperfect proxies or incomplete features. We propose Aggregate-then-Calibrate (AtC), a two-stage framework that combines these complementary sources. Stage-1 aggregates heterogeneous comparative judgments into a consensus ranking $\hat{\pi}$ using a rank-aggregation model that accounts for annotator reliability. Stage-2 calibrates any predictive model’s scores by an isotonic projection onto the order $\hat{\pi}$, enforcing ordinal consistency while preserving as much of the model’s quantitative information as possible. Theoretically, we show: (1) modeling annotator heterogeneity yields strictly more efficient consensus estimation than homogeneity; (2) isotonic calibration enjoys risk bounds even when the consensus ranking is misspecified; and (3) AtC asymptotically outperforms model-only assessment. Across semi-synthetic and real-world datasets, AtC consistently improves accuracy and robustness over human-only or model-only assessments. Our results bridge judgment aggregation with model-free calibration, providing a principled recipe for human-centered assessment when ground truth is costly, scarce, or unverifiable.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 12500

Loading