Benchmark Probing: Investigating Data Leakage in Large Language Models

Chunyuan Deng; Yilun Zhao; Xiangru Tang; Mark Gerstein; Arman Cohan

Benchmark Probing: Investigating Data Leakage in Large Language Models

Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, Arman Cohan

Published: 28 Oct 2023, Last Modified: 13 Mar 2024NeurIPS 2023 BUGS PosterEveryoneRevisionsBibTeX

Keywords: Data Contamination, Benchmark Probing, Large Language Model, Natural Language Processing

TL;DR: In this work, we propose a novel investigation protocol to detect potential data leakage in benchmark test sets suitable for both open-source and closed-source models.

Abstract: Large language models have consistently demonstrated exceptional performance across a wide range of natural language processing tasks. However, concerns have been raised about whether LLMs rely on benchmark data during their training phase, potentially leading to inflated scores on these benchmarks. This phenomenon, known as data contamination, presents a significant challenge within the context of LLMs. In this paper, we present a novel investigation protocol named $\textbf{T}$estset $\textbf{S}$lot Guessing ($\textbf{TS-Guessing}$) on knowledge-required benchmark MMLU and TruthfulQA, designed to estimate the contamination of emerging commercial LLMs. We divide this protocol into two subtasks: (i) $\textit{Question-based}$ setting: guessing the missing portions for long and complex questions in the testset (ii) $\textit{Question-Multichoice}$ setting: guessing the missing option given both complicated questions and options. We find that commercial LLMs could surprisingly fill in the absent data and demonstrate a remarkable increase given additional metadata (from 22.28\% to 42.19\% for Claude-instant-1 and from 17.53\% to 29.49\% for GPT-4).

Submission Number: 21

Loading