SWE-InfraBench: Evaluating Language Models on Cloud Infrastructure Code

Natalia Tarasova; Enrique Balp-Straffon; Aleksei Iancheruk; Yevhenii Sielskyi; Nikita Kozodoi; Liam H. Byrne; Jack Butler; Dayuan jiang; Marcin Czelej; Andrew Ang; Yash Shah; Roi Blanco; Sergei Ivanov

SWE-InfraBench: Evaluating Language Models on Cloud Infrastructure Code

Natalia Tarasova, Enrique Balp-Straffon, Aleksei Iancheruk, Yevhenii Sielskyi, Nikita Kozodoi, Liam H. Byrne, Jack Butler, Dayuan jiang, Marcin Czelej, Andrew Ang, Yash Shah, Roi Blanco, Sergei Ivanov

Published: 24 Sept 2025, Last Modified: 14 Oct 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Infrastructure-as-Code, Cloud Computing, Code-generation, dataset, benchmarks, LLM

TL;DR: We introduce a benchmark of real AWS CDK projects to test LLMs on Infrastructure-as-Code editing, and find that even the best models solve only a third of tasks, but improve with multi-turn agentic approaches.

Abstract: Infrastructure-as-code (IaC) is critical for cloud reliability and scalability, yet LLM capabilities in this domain remain underexplored. Existing benchmarks focus on declarative tools like Terraform and full-code generation. We introduce SWE-InfraBench, a dataset of realistic incremental edits to AWS CDK repositories from real-world codebases. Each task requires modifying existing IaC based on natural language instructions, with correctness verified by passed tests. Results show current LLMs struggle: the best model (Sonnet 3.7) solves 34% of tasks, while reasoning models like DeepSeek R1 reach only 24%

Submission Number: 108

Loading