SWE-InfraBench: Evaluating Language Models on Cloud Infrastructure Code

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Infrastructure-as-Code, Cloud Computing, Code-generation, dataset, benchmarks, LLM
TL;DR: We introduce a benchmark of real AWS CDK projects to test LLMs on Infrastructure-as-Code editing, and find that even the best models solve only a third of tasks, but improve with multi-turn agentic approaches.
Abstract: Infrastructure-as-code (IaC) is critical for cloud reliability and scalability, yet LLM capabilities in this domain remain underexplored. Existing benchmarks focus on declarative tools like Terraform and full-code generation. We introduce SWE-InfraBench, a dataset of realistic incremental edits to AWS CDK repositories from real-world codebases. Each task requires modifying existing IaC based on natural language instructions, with correctness verified by passed tests. Results show current LLMs struggle: the best model (Sonnet 3.7) solves 34% of tasks, while reasoning models like DeepSeek R1 reach only 24%
Submission Number: 108
Loading