# ITBench **[Paper](./it_bench_arxiv.pdf) | [Scenarios](#scenarios) | [Agents](#agents) | [How to Cite]()| [Contributors](#contributors) | [Contacts](#contacts)** # 📢 Announcements ## Latest Updates - **[February 28, 2025]** Limited Access Beta 🏆: Invite-only access to the ITBench hosted scenario environments. ITBench handles scenario deployment, agent evaluation, and leaderboard updates. To request access, e-mail us [here](agent-bench-automation@ibm.com). - **[February 7, 2025]** Initial release! 🎉 Includes research paper, self-hosted environment setup tooling, sample scenarios, and baseline agents. ## Coming Soon - **[April 2025]** Public Launch 🚀 Complete ITBench platform access opens to all. ## Overview The goal of ITBench is to measure the performance of AI agents across a wide variety of complex and real-life IT automation tasks targetting three key personas: - Site Reliability Engineering (SRE) - focusing on availability and resiliency - Financial Operations (FinOps) - focusing on enforcing cost efficiencies and optimizing return on investment - Compliance and Security Operations (CISO) - focusing on ensuring compliance and security of IT implementations ![sample_tasks](./images/sample_it_tasks.png) Through push-button workflows and interpretable metrics, it helps AI researchers and developers explore both the challenges and potential of IT automation. ITBench centers on two core principles: 1. Real-world representation of IT environments and incident scenarios that happen in such environments 2. Open, extensible framework with comprehensive IT coverage ITBench enables researchers and developers to replicate real-world incidents in Kubernetes environments (scenarios) and develop AI agents to address them. As of February 2025, we are open-sourcing: 1. Push-button deployment tooling for environment setup 2. Framework for recreating: * 6 SRE scenarios * 1 FinOps scenario * 4 categories of CISO scenarios 3. Two reference AI agents: * CISO (Chief Information Security Officer) Agent * SRE (Site Reliability Engineering) Agent ## Scenarios ITBench incorporates a collection of problems that we call scenarios. For example, one of the SRE scenarios in ITBench is to resolve a “High error rate on service order-management” in a Kubernetes environment. Another scenario that is relevant for the CISO persona involves assessing the compliance posture for a “new control rule detected for RHEL 9.” Each of the ITBench scenarios are deployed in an operational environment in which problem(s) occur. The scenarios can be found [here](https://github.com/IBM/ITBench-Scenarios). ## Agents Two baseline agents (SRE-FinOps and CISO) are being open-sourced with the ITBench. We use the open-source CrewAI framework to create and manage agents. The agents can be configured to use various LLMs either through watsonx, Azure, or vLLM. Each agent is initialized with a prompt that describes its goal, the context, the tasks, and the expected output format. In-context learning examples are included to guide the agent and demonstrate tool usage. Agents use natural language to access tools to interact with the environment for information gathering. ### CAA Agent Source code repository [here](https://github.com/IBM/itbench-ciso-caa-agent). ### SRE Agent Source code repository [here](https://github.com/IBM/itbench-sre-agent). ### How to Cite ``` @misc{jha2025itbench, title={ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks}, author={Jha, Saurabh and Arora, Rohan and Watanabe, Yuji and others}, year={2025}, url={https://github.com/IBM/itbench-sample-scenarios/blob/main/it_bench_arxiv.pdf} } ``` ## Contributors - Saurabh Jha - Rohan Arora - Yuji Watanabe - Takumi Yanagawa - Yinfang Chen (UIUC - University of Illinois at Urbana-Champaign) - Jackson Clark (UIUC - University of Illinois at Urbana-Champaign) - Bhavya Bhavya - Mudit Verma - Harshit Kumar - Hirokuni Kitahara - Noah Zheutlin - Saki Takano - Divya Pathak - Felix George - Xinbo Wu (UIUC - University of Illinois at Urbana-Champaign) - Bekir O Turkkan - Gerard Vanloo - Michael Nidd - Ting Dai - Oishik Chatterjee - Pranjal Gupta - Suranjana Samanta - Pooja Aggarwal - Rong Lee - Pavankumar Murali - Jae-wook Ahn - Debanjana Kar - Ameet Rahane - Carlos Fonseca - Amit Paradkar - Yu Deng - Pratibha Moogi - Prateeti Mohapatra - Naoki Abe - Chandrasekhar Narayanaswami - Tianyin Xu (UIUC - University of Illinois at Urbana-Champaign) - Lav R. Varshney (UIUC - University of Illinois at Urbana-Champaign) - Ruchi Mahindru - Anca Sailer - Laura Shwartz - Daby Sow - Nicholas C. M. Fuller - Ruchir Puri ## Contacts - agent-bench-automation@ibm.com - Saurabh Jha (saurabh.jha@ibm.com) - Yuji Wantabe (muew@jp.ibm.com) - Ruchi Mahindru (rmahindr@us.ibm.com) - Anca Sailer (ancas@us.ibm.com)