CodeMirage: Stress-Testing AI-Generated Code Detectors Against Production-Level LLMs

CodeMirage: Stress-Testing AI-Generated Code Detectors Against Production-Level LLMs

ICLR 2026 Conference Submission13827 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, AI-generated Code Detection, Adversarial Perturbation

Abstract: Large language models (LLMs) are increasingly integrated into software development, generating substantial volumes of source code. While they enhance productivity, their misuse raises serious concerns, including plagiarism, license violations, and the propagation of insecure code. Robust detection of AI-generated code is therefore essential, and requires benchmarks that faithfully reflect real-world conditions. Existing benchmarks, however, are limited in scope, covering few programming languages and relying on less capable models. In this paper, we introduce ***CodeMirage***, a comprehensive benchmark that addresses these gaps through three key contributions: (1) coverage of ten widely used programming languages, (2) inclusion of both original and perturbed code from ten state-of-the-art, production-level LLMs, and (3) six progressively challenging tasks across four evaluation configurations. Using ***CodeMirage***, we evaluate ten representative detectors spanning four methodological paradigms under realistic settings, with performance reported across three complementary metrics. Our analysis yields eight key findings that reveal the strengths and limitations of current detectors and highlight critical challenges for future research. We believe ***CodeMirage*** provides a rigorous and practical testbed to drive the development of more robust and generalizable AI-generated code detectors.

Primary Area: datasets and benchmarks

Submission Number: 13827

Loading