ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks

Saurabh Jha; Rohan R. Arora; Yuji Watanabe; Takumi Yanagawa; Yinfang Chen; Jackson Clark; Bhavya Bhavya; Mudit Verma; Harshit Kumar; Hirokuni Kitahara; Noah Zheutlin; Saki Takano; Divya Pathak; Felix George; Xinbo Wu; Bekir O Turkkan; Gerard Vanloo; Michael Nidd; Ting Dai; Oishik Chatterjee; Pranjal Gupta; Suranjana Samanta; Pooja Aggarwal; Rong Lee; Jae-wook Ahn; Debanjana Kar; Amit Paradkar; Yu Deng; Pratibha Moogi; Prateeti Mohapatra; Naoki Abe; Chandrasekhar Narayanaswami; Tianyin Xu; Lav R. Varshney; Ruchi Mahindru; Anca Sailer; Laura Shwartz; Daby Sow; Nicholas C. M. Fuller; Ruchir Puri

ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 oralEveryoneRevisionsBibTeXCC BY-NC-ND 4.0

TL;DR: Benchmark for IT automation tasks

Abstract: Realizing the vision of using AI agents to automate critical IT tasks depends on the ability to measure and understand effectiveness of proposed solutions. We introduce ITBench, a framework that offers a systematic methodology for benchmarking AI agents to address real-world IT automation tasks. Our initial release targets three key areas: Site Reliability Engineering (SRE), Compliance and Security Operations (CISO), and Financial Operations (FinOps). The design enables AI researchers to understand the challenges and opportunities of AI agents for IT automation with push-button workflows and interpretable metrics. IT-Bench includes an initial set of 102 real-world scenarios, which can be easily extended by community contributions. Our results show that agents powered by state-of-the-art models resolve only 11.4% of SRE scenarios, 25.2% of CISO scenarios, and 25.8% of FinOps scenarios (excluding anomaly detection). For FinOps-specific anomaly detection (AD) scenarios, AI agents achieve an F1 score of 0.35. We expect ITBench to be a key enabler of AI-driven IT automation that is correct, safe, and fast. IT-Bench, along with a leaderboard and sample agent implementations, is available at https://github.com/ibm/itbench.

Lay Summary: Imagine using smart computer programs, called AI agents, to automatically handle complex IT tasks like keeping IT systems running smoothly (Site Reliability Engineering - SRE), ensuring security and compliance (Compliance and Security Operations - CISO), or managing technology spending (Financial Operations - FinOps). For this to become a reality, we need a reliable way to check if these AI agents are actually good at these jobs. Our work (IT-Bench) solves this challenge by providing the first comprehensive testing framework for IT automation agents. Think of it as a standardized assessment that evaluates AI performance across 102 real-world IT scenarios. This benchmark enables researchers to objectively compare different AI systems and measure their capabilities with precision. The initial tests using IT-Bench revealed that even the most advanced AI agents still have a long way to go. They were only able to successfully solve about 11.4% of the site reliability problems, about 25.2% of the security problems, and 25.8% of the financial management problems. By providing this testing framework, IT-Bench aims to help researchers and developers build better AI agents that can correctly, safely, and quickly automate IT tasks, ultimately making technology more reliable and efficient.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/ibm/itbench

Primary Area: Applications

Keywords: Benchmark, GenAI, Agents, IT Automation

Submission Number: 8021

Loading