Keywords: Uncertainty estimation, epistemic/aleatoric decomposition, calibration, adversarial NLI, instruction- tuned LLMs
TL;DR: URC² is an uncertainty-routed Human–LLM relabeling pipeline that disentangles aleatoric vs. epistemic uncertainty to improve calibration (–30% ECE) and reliability on ANLI without sacrificing accuracy.
Abstract: Adversarial NLI (ANLI) reveals distribution-shift failures that static benchmarks
miss, motivating evaluation and curation that are explicitly uncertainty-aware.
We present URC2—Uncertainty-Routed Curation & Calibration, a three-
stage pipeline that improves dataset quality and model reliability. URC2 de-
composes per-example predictive uncertainty into aleatoric (data/label ambigu-
ity) and epistemic (model uncertainty, measured via mutual information) using
a three-teacher ensemble (DeBERTa-v3-large, RoBERTa-large, XLM-R-large).
A two-lane relabeling workflow then routes cases: a Human lane relabels, re-
moves, or down-weights aleatoric-heavy examples, while an LLM lane adjudi-
cates epistemic-heavy examples using instruction-tuned self-consistency checks.
Curated labels and per-example weights drive a lightweight retraining and recal-
ibration loop for each teacher, yielding an updated ensemble. On ANLI, URC2
reduces development-set expected calibration error by 30% (to 0.146) and low-
ers corpus-level uncertainty without degrading accuracy. Unlike prior curation
pipelines that treat uncertainty monolithically, URC2 exploits the distinction be-
tween aleatoric and epistemic uncertainty to enable uncertainty-aware control,
identifying recurring failure modes and mitigating them via targeted reweighting
and data augmentation. URC2 provides a practical, reproducible recipe for build-
ing more trustworthy NLI systems under adversarial shift.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 15355
Loading