Keywords: LLM evaluation, multi-agent
Abstract: The rapid evolution of reasoning-intensive Large Language Models renders traditional metrics insufficient by masking fine-grained failures and implicit pathologies.
Existing weakness discovery methods typically rely on rigid pipelines, yielding superficial insights that lack the diagnostic depth required for effective model improvement. To address this, we introduce Agent4Weakness, a multi-agent framework designed to replicate the rigorous workflow of human expert analysts. By integrating a Domain-Aware Memory for contextual reasoning with professional evaluation knowledge and a Tool Abstraction mechanism for decouple data analysis, Agent4Weakness transforms raw evaluation traces into grounded, actionable reports. We validate our framework through an extensive study involving $104$ models across $27$ benchmarks. Experimental results demonstrate that Agent4Weakness produces diagnostic reports significantly superior to competitive baselines.
Crucially, leveraging these insights for prompt guidance yields an average $3.7$ point performance boost and establishes a closed-loop optimization paradigm.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: evaluation methodologies, LLM agents
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 6168
Loading