Is Your Benchmark Still Useful? Dynamic Benchmarking for Code Language Models

Batu Guan; Xiao Wu; Yuanyuan Yuan; Shaohua Li

Is Your Benchmark Still Useful? Dynamic Benchmarking for Code Language Models

Batu Guan, Xiao Wu, Yuanyuan Yuan, Shaohua Li

Published: 22 Sept 2025, Last Modified: 25 Nov 2025DL4C @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Code Large Language Model, Code Reasoning, Dynamic Benchmark

Abstract: In this paper, we tackle a critical challenge in model evaluation: how to keep code benchmarks useful when models might have already seen them during training. We introduce a novel solution, dynamic benchmarking framework, to address this challenge. Given a code understanding or reasoning benchmark, our framework dynamically transforms each input, i.e., programs, with various semantic-preserving mutations to build a syntactically new while semantically identical benchmark. We evaluated 10 popular language models on our dynamic benchmarks. Our evaluation reveals several interesting or surprising findings: (1) all models perform significantly worse than before, (2) the ranking between some models shifts dramatically, and (3) dynamic benchmarks can resist against the data contamination problem.

Submission Number: 38

Loading