Evaluating Frontier Agents on End-to-End Investment Banking Workflows

Elaine Lau; Rosemary Wei; Guram Gogia; Ronak Chaudhary; Yi Liu; Saed Qunbar; Hui Wen Goh; Scott Millslagle; Samuel Eshun Danquah; Punit Arani; Ray Epps; Markus Dücker; Abdullah Arif; Asrith Devalaraju; Varsha Sandadi; Haemi Nam; Sahil Bhaiwala; Skyler Wang; Anish Athalye; Jonas Mueller; Francisco Guzmán

Evaluating Frontier Agents on End-to-End Investment Banking Workflows

Published: 02 Mar 2026, Last Modified: 02 Apr 2026ICLR 2026 Workshop DATA-FMEveryoneRevisionsCC BY 4.0

Keywords: AI agents, Benchmarks, Evaluations, Tool use, Economically valuable AI, Investment banking

TL;DR: We introduce a realistic benchmark of end-to-end investment banking workflows and show that today’s frontier AI agents, even with access to industry tools and data rooms, still fail to reliably complete these high-stakes tasks.

Abstract: AI agents are expected to revolutionize professional work, but a basic question remains open: How well can today’s frontier models complete end-to-end analytical workflows in economically high-value settings? We examine this question through the lens of investment banking, evaluating the performance of AI agents on tasks routinely performed by junior bankers. To ensure ecological validity, we collaborated with 175 investment bankers to develop an evaluation suite that replicates core features of their professional environment. Agents are assigned VP (Vice President) and MD (Managing Director)-level requests; granted access to realistic data rooms and industry-standard tools (e.g., FactSet and SEC EDGAR); and required to produce multi-file deliverables, including financial models, slide decks, reports, and email summaries. Completing individual tasks required as much as 8 hours of banker time, highlighting the nontrivial labor investment and economic stakes for agents seeking to perform them autonomously. Benchmarking eight frontier models, we find that current AI systems struggle to reliably complete these workflows: even the best-performing model in our study (Claude Opus 4.5) achieves only 33.8% success. Our error analysis identifies key obstacles and routes to economic value when deploying agentic AI in high-stakes professional domains (such as internal consistency across deliverables and their client readiness).

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 144

Loading