HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks

Published: 01 Mar 2026, Last Modified: 24 Apr 2026ICLR 2026 AIWILDEveryoneRevisionsCC BY 4.0
Keywords: Healthcare Administration, Computer Use Agents, LLMs, AI
Abstract: Healthcare administration accounts for over \$1 trillion in annual spending, making it a promising target for LLM-based computer-use agents (CUAs). While clinical applications of LLMs have received significant attention, no benchmark exists for evaluating CUAs on end-to-end administrative workflows. To address this gap, we introduce **HealthAdminBench**, a benchmark comprising four realistic GUI environments—an EHR, two payer portals, and a fax system—and 135 expert-defined tasks spanning three administrative task types: Prior Authorization, Appeals and Denials Management, and Durable Medical Equipment (DME) Order Processing. Each task is decomposed into fine-grained, verifiable subtasks, yielding 1,698 evaluation points. We evaluate seven agent configurations under multiple prompting and observation settings and find that, despite strong subtask performance, end-to-end reliability remains low: the best-performing agent (Claude Opus 4.6 CUA) achieves only 36.3\% task success, while GPT-5.4 CUA attains the highest subtask success rate (82.8\%). These results reveal a substantial gap between current agent capabilities and the demands of real-world administrative workflows. **HealthAdminBench** provides a rigorous foundation for evaluating progress toward safe and reliable automation of healthcare administrative workflows. We release the benchmark, environments, and leaderboard at https://healthadminbench.stanford.edu.
PDF: pdf
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 141
Loading