Efficiently Measuring the Cognitive Ability of LLMs: An Adaptive Testing Perspective

Yan Zhuang; Qi Liu; Yuting Ning; Weizhe Huang; Rui Lv; Zhenya Huang; GuanHao Zhao; Zheng Zhang; Qingyang Mao; Shijin Wang; Enhong Chen

Efficiently Measuring the Cognitive Ability of LLMs: An Adaptive Testing Perspective

Yan Zhuang, Qi Liu, Yuting Ning, Weizhe Huang, Rui Lv, Zhenya Huang, GuanHao Zhao, Zheng Zhang, Qingyang Mao, Shijin Wang, Enhong Chen

24 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: large language model, adaptive testing, model evaluation

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: This paper allows for a more accurate estimation of the LLM's abilities, using fewer questions.

Abstract: Large language models (LLMs), like ChatGPT, have shown human-level cognitive ability. Benchmarks from various fields (e.g., Literature, Biology and Psychology) are often used to measure LLM's ability and report standard metrics such as accuracy, recall and F1. However, such method for evaluating LLMs can be inefficient and inaccurate from the cognitive science perspective. Inspired by Computerized Adaptive Testing (CAT) used in psychometrics, we propose an adaptive testing framework for LLM evaluation. Rather than using a standard test set and simply reporting accuracy, this approach dynamically adjusts the characteristics of the test questions, such as difficulty, based on the model's performance. This allows for a more accurate estimation of the model's abilities, using fewer questions. More importantly, it allows LLMs to be compared with humans easily, which is essential for NLP models that aim for human-level ability. Our diagnostic reports have found that ChatGPT often behaves like a ''careless student'', prone to slip and occasionally guessing the questions. We conduct a fine-grained diagnosis and rank 6 commercial instruction-tuned LLMs from three aspects of Subject Knowledge, Mathematical Reasoning, and Programming, where GPT4 can outperform other models significantly and reach the cognitive ability of middle-level students. Different tests for different models using efficient adaptive testing --- we believe this will become the new norm in large language model evaluation.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9005

Loading