An Evaluation System for Large Language Models based on Open-Ended Questions

Zhiyuan Cao, Zeyu Ma, Mingang Chen

Published: 2024, Last Modified: 16 Apr 2025CSCloud 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We designed a large language model evaluation system based on open-ended questions. The system accomplished multidimensional evaluation of LLMs using open-ended questions, and it presented evaluation results with evaluation reports. Currently, the evaluation of large-scale language models often exists with two prominent limitations: (1) The evaluation methods are often single-minded, resulting in less credible results. (2) Most evaluations are based on datasets with closed-ended questions, treating generative large language models as discriminative models, which fails to adequately reflect the high output flexibility characteristic of these models. For these two limitations, we proposed an evaluation system for LLMs based on open-ended questions. Our experiments on the adapted open-source datasets demonstrated the effectiveness of this system. The code of the system was released on https://github.com/JerryMazeyu/GreatLibrarian.