Keywords: large language models, model evaluation, malicious actors
TL;DR: We analyze current contamination detection methods and find a significant vulnerability in their assumptions that can be easily exploited by malicious actors.
Abstract: The benchmark performance of large language models (LLMs) has a high impact on their popularity and is thus of great importance to many model providers. However, the reliability of such benchmark scores as a measure of model quality gets compromised if the model is contaminated with benchmark data. While recent contamination detection methods try to address this issue, they overlook the possibility of deliberate contamination by malicious model providers aiming to evade detection. We propose a categorization of model providers based on their (de)contamination practices and argue that malicious contamination is of crucial importance as it casts doubt on the reliability of public benchmarks. To study this issue more rigorously, we analyze current contamination detection methods based on their assumptions. This analysis reveals a significant vulnerability in existing approaches: they do not account for rephrased benchmark data used during training by malicious actors. We demonstrate how exploiting this gap can result in significantly inflated benchmark scores while completely evading current detection methods.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9617
Loading