Beyond Score: A Multi-Agent System to Discover Capability and Behavioral Weaknesses in LLMs

ICLR 2026 Conference Submission20198 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM evaluation, multi-agent
Abstract: A key task for researchers working on large language models (LLMs) is to compare the results and behavioral performance of different models, thereby identifying model weaknesses and enabling further model improvements. However, as LLMs are applied in an increasing range of scenarios and the number of benchmarks continues to grow, the difficulty of accurately identifying weaknesses increases. Additionally, with the emergence of Reasoning LLMs, researchers need to analyze the chain-of-thought (CoT) behaviors of models to gain insights—this makes the task of directly analyzing model capabilities based on benchmark evaluation results more onerous and unreliable. To address these issues, we propose Agent4Weakness, a framework that uses multi-agent collaboration to generate evaluation reports with user requirements for LLM evaluation. Specifically, Agent4Weakness employs multiple mainstream LLMs for evaluation and comparison, incorporating professional statistical tools to provide richer statistical insights. Besides, Agent4Weakness features a dedicated agent designed to extract relevant information from the results according to user requirements, ensuring the final analysis is tailored to user needs. We show that reports generated by Agent4Weakness achieve an improvement of $2.6$ out of $10$ across four dimensions compared with the baseline, with high consistency with human evaluations, which proves the high quality of the reports. Furthermore, performance improvements guided by the reports from Agent4Weakness lead to a $3.7$ gain by addressing the discovered weaknesses, demonstrating significant practical value of Agent4Weakness.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 20198
Loading