Gotta Catch'em All!: Multi-Generator and Multi-Lingual Benchmark for Detecting LLM-Generated Code Snippets

ACL ARR 2024 April Submission718 Authors

16 Apr 2024 (modified: 13 May 2024)ACL ARR 2024 April SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The advent of commercial services based on large language models (LLMs), such as ChatGPT and Copilot, has significantly enhanced software development productivity. Despite their widespread adoption and benefits, concerns regarding the security vulnerabilities of LLM-generated code snippets, potential copyright and licensing infringements, and academic cheating. Recognizing the importance of detecting LLM-generated code snippets, we introduce the first benchmark, DeCo (Detecting Code generated by LLMs), aimed at addressing these challenges. DeCo comprises a dataset of 246K samples across four programming languages: C, C++, Java, and Python, generated by two commercial LLMs, ChatGPT and Gemini-Pro, and two open-source, code-specialized LLMs, WizardCoder and DeepSeek-Coder. We formulate two key tasks based on the DeCo: (1) binary detection to discern whether a given code snippet was written by a human or an LLM, and (2) multi-class detection to identify the specific generator among humans and the four LLMs. We conduct extensive experiments evaluating 13 detection methods on the DeCo dataset.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: dataset creation, benchmarking, language resources, multilingual corpora, source code datasets, automatic evaluation of datasets, evaluation
Contribution Types: Data resources, Data analysis
Languages Studied: C, C++, Java, Python
Submission Number: 718
Loading