DynaEval: A Dynamic Interaction-based Evaluation Framework for Assessing LLMs in Real-world Scenarios

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: large language model, evaluation, game theory, code generation, machine translation, multi-agent system
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: This paper proposes a dynamic interaction-based evaluation framework which utilizes properties of dynamc games in game theory to overcome important challenges widely exist in the evaluation of LLMs in real-world scenarios.
Abstract: Large language models (LLMs) have shown significant advancements in diverse real-world applications, underscoring the necessity for comprehensive evaluation methodologies. Existing research about LLM evaluation usually concentrates on supervised signal-based evaluation benchmarks on domain-specific tasks, which utilize static labeled datasets to evaluate the abilities of LLMs. However, these methods often fall short in evaluating LLMs in dynamic real-world scenarios, which can be viewed as goal-driven multi-agent scenarios. In these scenarios, agents have to repeatedly obtain feedbacks and improve their outputs through cooperative or adversarial interactions in order to gradually reach their goals. To address this problem, inspired by game theory, we propose a novel dynamic interaction-based LLM evaluation framework (DynaEval) for evaluating abilities of LLMs in dynamic real-world scenarios. Specifically, we first standardize the definition of the interaction process in dynamic real-world scenarios. Next, we prove that interaction processes in evaluation tasks are equivalent to a class of dynamic games in game theory, which is beneficial to the fairness and stability of evaluation. Inspired by game theory, we propose the message pool and LLM-based referee components of DynaEval, leveraging the properties of dynamic games to ensure fairness and stability throughout the interaction and evaluation process. Moreover, we propose the synchronous interaction algorithm, which is suitable for all kinds of interactions in real-world tasks. Finally, we demonstrate the effectiveness of DynaEval through extensive experiments across four interaction-based evaluation tasks stemming from real-world scenarios. Our source code is available at https://anonymous.4open.science/r/DynaEval-112F.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5534
Loading