Keywords: LLM evaluation, multi-agent
Abstract: A key task for researchers working on large language models (LLMs) is to compare the results and behavioral performance of different models, thereby identifying model weaknesses and enabling further model improvements.
However, as LLMs are applied in an increasing range of scenarios and the number of benchmarks continues to grow, the difficulty of accurately identifying weaknesses increases.
Additionally, with the emergence of Reasoning LLMs, researchers need to analyze the chain-of-thought (CoT) behaviors of models to gain insights—this makes the task of directly analyzing model capabilities based on benchmark evaluation results more onerous and unreliable.
To address these issues, we propose Agent4Weakness, a framework that uses multi-agent collaboration to generate evaluation reports with user requirements for LLM evaluation.
Specifically, Agent4Weakness employs multiple mainstream LLMs for evaluation and comparison, incorporating professional statistical tools to provide richer statistical insights.
Besides, Agent4Weakness features a dedicated agent designed to extract relevant information from the results according to user requirements, ensuring the final analysis is tailored to user needs.
We show that reports generated by Agent4Weakness achieve an improvement of $2.6$ out of $10$ across four dimensions compared with the baseline, with high consistency with human evaluations, which proves the high quality of the reports.
Furthermore, performance improvements guided by the reports from Agent4Weakness lead to a $3.7$ gain by addressing the discovered weaknesses, demonstrating significant practical value of Agent4Weakness.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 20198
Loading