DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination
TL;DR: We identify the limitations of static benchmarking for code LLMs and propose a dynamic benchmarking approach.
Abstract: The rapid advancement of code large language models (Code LLMs) underscores the critical need for effective and transparent benchmarking methods. However, current benchmarking predominantly relies on publicly available, human-created datasets. The widespread use of these static benchmark datasets makes the evaluation process particularly susceptible to data contamination—an unavoidable consequence of the extensive data collection processes employed during LLM training. Existing methods for addressing data contamination typically face significant limitations, including reliance on substantial human effort and difficulty in managing class imbalances. To overcome these challenges, we propose DyCodeEval, a novel benchmarking suite specifically designed to evaluate Code LLMs under realistic contamination scenarios. Given an initial seed programming problem, DyCodeEval utilizes multiple agents to systematically extract and modify contextual information without changing the core logic, generating semantically equivalent variations. We introduce a dynamic data generation method and conduct extensive empirical studies on two seed datasets involving 18 Code LLMs. The results demonstrate that DyCodeEval effectively assesses the reasoning capabilities of Code LLMs under contamination conditions while producing diverse problem variants, thereby ensuring robust and consistent benchmarking outcomes.
Lay Summary: Large language models (LLMs) are increasingly used to write code and solve programming tasks. But evaluating whether these models truly understand code is challenging. Most benchmarks rely on public test problems, which may already appear in the models’ training data—like giving students an exam they've already seen.
To address this, we developed a new "dynamic" way to evaluate code LLMs more fairly. Starting from a few original problems, we automatically generate new but logically equivalent versions using LLM agents. This helps test whether the model can reason about the problem rather than just recall memorized answers.
Our approach provides a more robust and diverse benchmark for assessing code LLMs, helping researchers and developers better understand what these models can and cannot do in realistic settings.
Link To Code: https://codekaleidoscope.github.io/dycodeeval.html
Primary Area: Deep Learning->Large Language Models
Keywords: benchmarking, code generation, large language model, trustworthy ML
Submission Number: 14922
Loading