Agent4Weakness: An Agentic Framework for In-Depth Model Weakness Discovery

Agent4Weakness: An Agentic Framework for In-Depth Model Weakness Discovery

ACL ARR 2026 January Submission6168 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM evaluation, multi-agent

Abstract: The rapid evolution of reasoning-intensive Large Language Models renders traditional metrics insufficient by masking fine-grained failures and implicit pathologies. Existing weakness discovery methods typically rely on rigid pipelines, yielding superficial insights that lack the diagnostic depth required for effective model improvement. To address this, we introduce Agent4Weakness, a multi-agent framework designed to replicate the rigorous workflow of human expert analysts. By integrating a Domain-Aware Memory for contextual reasoning with professional evaluation knowledge and a Tool Abstraction mechanism for decouple data analysis, Agent4Weakness transforms raw evaluation traces into grounded, actionable reports. We validate our framework through an extensive study involving $104$ models across $27$ benchmarks. Experimental results demonstrate that Agent4Weakness produces diagnostic reports significantly superior to competitive baselines. Crucially, leveraging these insights for prompt guidance yields an average $3.7$ point performance boost and establishes a closed-loop optimization paradigm.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: evaluation methodologies, LLM agents

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 6168

Loading