On Unsupervised Comparisons of Large Language Models

On Unsupervised Comparisons of Large Language Models

ACL ARR 2024 June Submission275 Authors

08 Jun 2024 (modified: 02 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Evaluation of LLMs has primarily relied on comparing against "gold" answers that often takes months or years to conduct and hence is difficult to scale. Instead of harnessing these supervised approaches that aim to rank LLMs, we propose to assess models by measuring and identifying the significance of their differences. This reduces the difficult supervised learning into an unsupervised task that saves the substantial labeling costs. More specifically, we introduce the notion of topic-categorized distinguisher questions that expose key behavioral differences and hence define distances between LLMs. We design a suite of algorithmic techniques for finding these distinguishers and make three major innovations, including (i) a new correlation specification on objective functions based on topic trees and earth-mover distance of topics, (ii) a theoretically sound embedding technique between EMD induced by topics and $\ell_2$-space used in Bayesian optimization (BO), and (iii) a Siamese-net based model leveraging our theoretical results that effectively interface topics and BO in practice. Our experiments showed the efficacy of our new algorithms, its power to distinguish LLMs in medical topics, and its application in unsupervised ranking.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: graph-based methods, optimization methods, active learning, automatic evaluation

Contribution Types: Approaches to low-resource settings, Theory

Languages Studied: English

Submission Number: 275

Loading