Keywords: large language model, automatic evaluation, evaluation methodologies
TL;DR: We propose a novel evauation paradigm,Agent-as-Interviewer, and develop a knowledge-wise dynamic evaluation framework JudgeAgent based on this paradigm to dynamically evaluate LLMs and provide valuable suggestions to optimize targe LLMs.
Abstract: Current evaluation paradigms for large language models (LLMs) suffer from overestimated or biased evaluation and mismatched question difficulty, leading to incomplete evaluations of LLM's knowledge and capability boundaries, which hinder LLM's effective application and optimization.
To address these challenges, we propose Agent-as-Interviewer, a dynamic evaluation paradigm that employs LLM agents to conduct multi-turn interactions for evaluation.
Unlike current benchmarking or dynamic interaction paradigms, Agent-as-Interviewer utilizes agents to call knowledge tools for wider and deeper knowledge in the dynamic multi-turn question generation, achieving more complete evaluations of the LLM's knowledge boundaries.
It also leverages agents to plan query strategies for adjustment of the question difficulty levels, enhancing the difficulty control to match the actual capabilities of target LLMs.
Based on this paradigm, we develop JudgeAgent, a knowledge-wise dynamic evaluation framework that employs knowledge-driven synthesis as the agent's tool, and uses difficulty scoring as strategy guidance, thereby finally providing valuable suggestions to help targets optimize themselves.
Extensive experiments validate the effectiveness of JudgeAgent's suggestions, demonstrating that Agent-as-Interviewer can accurately identify the knowledge and capability boundaries of target models.
The source code is available on https://anonymous.4open.science/r/JudgeAgent.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 16891
Loading