Keywords: contamination-aware evaluation, temporal screening, benchmark leakage, NF4 quantization, calibration, temperature scaling, uncertainty quantification, selective prediction, reproducibility, foundation models
TL;DR: TimeAlign is a contamination-aware, resource efficient evaluation framework that detects leaked items with 5-shingle Jaccard and post-T0 screening. It uses lightweight decontamination and temperature scaling to yield calibrated results.
Abstract: Evaluating foundation models under tight computational limits often hides contamination that inflates reported performance. We present TimeAlign, a contamination-aware evaluation framework built for resource-constrained settings. TimeAlign combines temporal screening, $5$-shingle Jaccard decontamination, and quantization-aware calibration to ensure validity with minimal compute. The detector reaches precision $P = 1.0$, recall $R = 0.96$, and inter-annotator agreement $\kappa \approx 0.94$. Screening against $30{,}700$ post-$T_0$ documents removes $33.3%$ of overlapping items across MMLU, MMLU-Pro, and ARC. A case study shows contamination can inflate accuracy by $74.5$ percentage points, where a model scoring $99.5%$ on contaminated data drops to $25.0%$ after decontamination.
On clean benchmarks, Llama-3.1-8B (FP16) attains MMLU accuracy $A = 67.5%$, with its NF4-quantized variant losing only $\Delta A \approx 1.7$ points. Temperature scaling with scalar $T \in [2.2, 2.5]$ halves the Smooth-ECE, achieving normalized risk-coverage $n\text{AURC} < 0.22$. A 720-item evaluation finishes within 8 hours on a single 24GB RTX 4090, with less than $2%$ overhead.
TimeAlign demonstrates that rigorous, contamination-free evaluation is achievable even under limited computational resources. It shows that efficiency and validity can coexist when guided by temporal screening and supported by uncertainty calibration and quantization.
Submission Number: 227
Loading