TL;DR: This paper introduces a novel benchmark to measure the correctness and security of LLM-generated code for backend applications.
Abstract: Automatic program generation has long been a fundamental challenge in computer science. Recent benchmarks have shown that large language models (LLMs) can effectively generate code at the function level, make code edits, and solve algorithmic coding tasks. However, to achieve full automation, LLMs should be able to generate production-quality, self-contained application modules. To evaluate the capabilities of LLMs in solving this challenge, we introduce BaxBench, a novel evaluation benchmark consisting of 392 tasks for the generation of backend applications. We focus on backends for three critical reasons: (i) they are practically relevant, building the core components of most modern web and cloud software, (ii) they are difficult to get right, requiring multiple functions and files to achieve the desired functionality, and (iii) they are security-critical, as they are exposed to untrusted third-parties, making secure solutions that prevent deployment-time attacks an imperative. BaxBench validates the functionality of the generated applications with comprehensive test cases, and assesses their security exposure by executing end-to-end exploits. Our experiments reveal key limitations of current LLMs in both functionality and security: (i) even the best model, OpenAI o1, achieves a mere 62% on code correctness; (ii) on average, we could successfully execute security exploits on around half of the correct programs generated by each LLM; and (iii) in less popular backend frameworks, models further struggle to generate correct and secure applications. Progress on BaxBench signifies important steps towards autonomous and secure software development with LLMs.
Lay Summary: A longstanding goal of computer science is the automation of software generation. On current coding benchmarks, often focused on short and algorithmic coding tasks, it seems that large language models (e.g., ChatGPT) are making decisive progress into this direction.
In our paper, we examine this progress from a critical angle, constructing a benchmark for testing the models’ capabilities of generating correct and secure lone-standing software backend modules. This is a key task in modern modular software development—correctness ensures that users do not encounter issues during using the software, and security implies that the service and its users are protected from malicious actors.
To this end, we define 28 coding scenarios asking for the implementation of modular tasks, such as programming a login, calculator, or an email unsubscription module. Then, we task the models to implement these 28 scenarios in 14 different backend development web-frameworks, such as Python Django, JavaScript Nest, or Go Fiber. We test the correctness and security of the models using concrete inputs to the programmed modules.
Evaluating 11 LLMs on our benchmark, we find that none of them perform to a satisfactory level, with the models generating incorrect or insecure solutions more than 60% of the time. In our experiments, we observe the promise of increasing the time and resource spent at generating the solutions for solving these issues, and give concrete guidance for developers on enhancing the secure and correct coding capabilities of their models.
Link To Code: https://github.com/logic-star-ai/baxbench
Primary Area: Social Aspects->Security
Keywords: large language model, large language models, LLM, code generation, code security, security, benchmark
Submission Number: 16211
Loading