Choosing Training-Time Calibration Objectives for Frozen Foundation-Model Features: A Linear-Probing Benchmark
Keywords: calibration, linear probing, foundation models, expected calibration error, temperature scaling, training-time calibration objectives, CLIP, DINOv2
TL;DR: On 15 frozen-feature settings (CLIP, DINOv2, CNNs), calibration-aware linear probing reveals a clean representation-family split: the best training-time objective depends on the backbone, and no single loss wins universally.
Abstract: Calibration objectives for deep classifiers have historically been designed under end-to-end training. Foundation models, however, are increasingly used through frozen-feature adaptation, and full fine-tuning to recalibrate is often infeasible. Post-hoc temperature scaling is cheap but limited to a scalar transform. We ask whether calibration-aware linear probing—relearning only the head under a calibration objective—can occupy the middle ground. Across 15 dataset–model settings spanning CLIP, DINOv2, same-domain CNNs, and cross-domain CNN transfer, the answer is a clean representation-family split rather than a universal winning loss. CLIP gains, when present, come from a direct confidence–accuracy penalty. DINOv2 leaves little reliable headroom beyond temperature scaling. Same-domain CNNs favor confidence- and margin-sensitive reweighting, including a new diagnostic V-family introduced here. Calibration-aware probing therefore serves both as a lightweight recalibration tool and as a diagnostic that exposes how frozen representations encode confidence. Objective choice is part of evaluating uncertainty on frozen foundation-model features, not a minor implementation detail.
Paper Type: Long (8 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 143
Loading