From Tasks to Teams: A Risk-First Evaluation Framework for Multi-Agent LLM Systems in Finance

Zichen Chen; Jianda Chen; Jiaao Chen; Misha Sra

From Tasks to Teams: A Risk-First Evaluation Framework for Multi-Agent LLM Systems in Finance

Zichen Chen, Jianda Chen, Jiaao Chen, Misha Sra

Published: 01 Jul 2025, Last Modified: 11 Jul 2025ICML 2025 R2-FM Workshop OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multi-agent LLMs, safety-aware evaluation, risk auditing, financial A

Abstract: Current financial benchmarks reward large language models (LLMs) task accuracy and portfolio return, yet remain blind to the risks that emerge once several agents cooperate, share tools, and act on real money. We present M-SAEA, a Multi-agent, Safety-Aware Evaluation Agent that audits an entire team of LLM agents without fine-tuning. M-SAEA issues ten zero-shot probes spanning four layers including model, workflow, interaction, and system, and returns a continuous [0, 100] risk vector plus a natural-language rationale. Across three high-impact task clusters (finance management, webshop automation, transactional services) and six popular models, M-SAEA (i) detects most unsafe trajectories while raising false alarms on only small number of safe ones; (ii) exposes latent hazards: temporal staleness, cross-agent race conditions, API-stress fragility, that leaderboard metrics never flag; and (iii) produces actionable, fine-grained scores that allow practitioners to trade off latency and safety before deployment. By turning safety into a measurable, model-agnostic quantity, M-SAEA shifts the evaluation focus from tasks to teams and provides a ready-to-use template for risk-first assessment of agentic AI in finance and beyond.

Submission Number: 180

Loading