Keywords: AI agents, Benchmarks, Evaluations, Tool use, Economically valuable AI, Investment banking
TL;DR: We introduce a realistic benchmark of end-to-end investment banking workflows and show that today’s frontier AI agents, even with access to industry tools and data rooms, still fail to reliably complete these high-stakes tasks.
Abstract: AI agents are expected to revolutionize professional work, but a basic question remains open: How well can today’s frontier models complete end-to-end analytical workflows in economically high-value settings? We examine this question through the lens of investment banking, evaluating the performance of AI agents on tasks routinely performed by junior bankers. To ensure ecological validity, we collaborated with 175 investment bankers to develop an evaluation suite that replicates core features of their professional environment. Agents are assigned VP (Vice President) and MD (Managing Director)-level requests; granted access to realistic data rooms and industry-standard tools (e.g., FactSet and SEC EDGAR); and required to produce multi-file deliverables, including financial models, slide decks, reports, and email summaries. Completing individual tasks required as much as 8 hours of banker time, highlighting the nontrivial labor investment and economic stakes for agents seeking to perform them autonomously. Benchmarking eight frontier models, we find that current AI systems struggle to reliably complete these workflows: even the best-performing model in our study (Claude Opus 4.5) achieves only 33.8% success. Our error analysis identifies key obstacles and routes to economic value when deploying agentic AI in high-stakes professional domains (such as internal consistency across deliverables and their client readiness).
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 144
Loading