Keywords: Large Language Models, Data Contamination, Memorization, Benchmark Evaluation, Trustworthy AI, Model Assessment, Data-Centric Evaluation
TL;DR: A data-centric framework to detect and adjust for contamination and memorization in LLM evaluation, ensuring more trustworthy benchmarking.
Abstract: In the current era, Large Language Models (LLMs) continue to achieve remarkable results yet their evaluation is increasingly undermined by data-centric challenges such as contamination, memorization and benchmark bias which threaten the reliability of reported performance. To address these issues, we propose DC-Guard (Data Centric Guard), a unified framework for trustworthy evaluation of LLMs. This framework introduces three novel components: the Memorization Consistency Index (MCI) to probe hidden memorization, the Benchmark Ecology Score (BES) to quantify representativeness relative to real-world corpora, and the Contamination-Resilient Metric Adjustment (CRMA) to correct evaluation scores for the contamination risk. Together, these elements provide contamination-aware, bias-adjusted reproducible assessments. Beyond presenting this methodology, we discuss open challenges in maintaining robust evaluations under evolving data sources and shifting usage contexts. DC-Guard offers principled guardrails for fair and transparent benchmarking of the large-scale language models.
Submission Number: 62
Loading