Evaluating Human-Language Model Interaction

Mina Lee; Megha Srivastava; Amelia Hardy; John Thickstun; Esin Durmus; Ashwin Paranjape; Ines Gerard-Ursin; Xiang Lisa Li; Faisal Ladhak; Frieda Rong; Rose E Wang; Minae Kwon; Joon Sung Park; Hancheng Cao; Tony Lee; Rishi Bommasani; Michael S. Bernstein; Percy Liang

Evaluating Human-Language Model Interaction

Mina Lee, Megha Srivastava, Amelia Hardy, John Thickstun, Esin Durmus, Ashwin Paranjape, Ines Gerard-Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda Rong, Rose E Wang, Minae Kwon, Joon Sung Park, Hancheng Cao, Tony Lee, Rishi Bommasani, Michael S. Bernstein, Percy Liang

Published: 10 Sept 2023, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate human-LM interaction, we develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems and dimensions to consider when designing evaluation metrics. Compared to standard, non-interactive evaluation, HALIE captures (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality (e.g., enjoyment and ownership). We then design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation. With four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21 Labs' Jurassic-1), we find that better non-interactive performance does not always translate to better human-LM interaction. In particular, we highlight three cases where the results from non-interactive and interactive metrics diverge and underscore the importance of human-LM interaction for LM evaluation.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: The changes are colored in blue and made in the following sections: * 1. Introduction - added a summary of contributions * 2.1. Solving Tasks Interactively - added rationales for the five tasks * 2.2. Constructing an Interactive System - added clarification on system logic vs. RL * 2.3. Evaluating Human-LM Interaction - added a note to clarify the novelty of dimensions * 2.4. Guideline - added guidelines for applying the framework * 3.3. Question Answering - added one sentence to distinguish potentially different use cases for non-interactive vs. interactive settings * 5.1. General Challenges in Human Evaluation - added a paragraph on subjectivity * 5.3. Limitations Specific to Our Experiment Design - added a paragraph on model selection

Code: https://github.com/stanford-crfm/halie

Assigned Action Editor: ~Yonatan_Bisk1

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Submission Number: 1343

Loading