BehaviorBox: Automated Behavioral Comparison of Language Models

ACL ARR 2025 February Submission7500 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Language model evaluation is a daunting task: prompts are brittle, corpus-level perplexities are vague, and the choice of benchmarks are endless. Choosing examples that show meaningful, generalizable differences between two LMs is crucial to understanding where one model succeeds and another fails. Can this process be done automatically? In this work, we propose methodology for automated behavioral comparison of language models that uses performance-aware contextual embeddings to find fine-grained features of text where one LM outperforms another. Our method, which we name BehaviorBox, is able to extract coherent features that also demonstrate statistically significant differences with respect to the ease of generation between two LMs. We apply BehaviorBox to compare models that vary in size, model family, and post-training, and enumerate insights into specific contexts that illustrate meaningful differences in performance.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: free-text/natural language explanations, feature attribution
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 7500
Loading