Keywords: contamination-aware evaluation, temporal screening, benchmark leakage, NF4 quantization, calibration, temperature scaling, uncertainty quantification, selective prediction, reproducibility, foundation models
TL;DR: TimeAlign is a contamination-aware, resource efficient evaluation framework that detects leaked items with 5-shingle Jaccard and post-T0 screening. It uses lightweight decontamination and temperature scaling to yield calibrated results.
Abstract: Evaluating foundation models under limited memory and compute budgets demands careful methodology as well as efficiency. We present TimeAlign, a contamination-aware framework that combines temporal screening, lightweight decontamination, and uncertainty quantification. The system uses an automated five-shingle Jaccard detector $(\kappa \approx 0.94)$ together with post-$T_0$ scanning over $30{,}700$ news documents, which enables robust identification of leaked benchmark items. Experiments with Llama-3.1-8B and Qwen2.5-7B on MMLU, MMLU-Pro, and ARC show that contamination can artificially inflate scores by as much as $74.5%$. In one case study, a fine-tuned contract QA model that appeared to achieve $99.5%$ accuracy dropped to $25.0%$ when near-duplicate items were removed, underscoring the severity of leakage. To support practical use, TimeAlign integrates quantization-aware calibration. NF4 quantization alone causes negligible degradation, while a simple temperature scaling step lowers Smooth-ECE by $54%$. We further introduce normalized risk-coverage curves that make selective prediction behavior comparable across benchmarks. The framework provides reproducible artifacts, including per-item predictions, contamination reports, and evaluation manifests, which ensure transparency and continual benchmarking. TimeAlign therefore establishes a low-overhead yet rigorous solution for contamination-aware evaluation of foundation models in resource-constrained environments.
Submission Number: 30
Loading