Abstract: Evaluation of LLMs has primarily relied on comparing against "gold" answers that often takes months or years to conduct and hence is difficult to scale. Instead of harnessing these supervised approaches that aim to rank LLMs, we propose to assess models by measuring and identifying the significance of their differences. This reduces the difficult supervised learning into an unsupervised task that saves the substantial labeling costs. More specifically, we introduce the notion of topic-categorized distinguisher questions that expose key behavioral differences and hence define distances between LLMs. We design a suite of algorithmic techniques for finding these distinguishers and make three major innovations, including (i) a new correlation specification on objective functions based on topic trees and earth-mover distance of topics, (ii) a theoretically sound embedding technique between EMD induced by topics and $\ell_2$-space used in Bayesian optimization (BO), and (iii) a Siamese-net based model leveraging our theoretical results that effectively interface topics and BO in practice. Our experiments showed the efficacy of our new algorithms, its power to distinguish LLMs in medical topics, and its application in unsupervised ranking.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: graph-based methods, optimization methods, active learning, automatic evaluation
Contribution Types: Approaches to low-resource settings, Theory
Languages Studied: English
Submission Number: 275
Loading