Uncovering Gaps in How Humans and LLMs Interpret Subjective Language

Published: 22 Jan 2025, Last Modified: 02 Mar 2025ICLR 2025 SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: safety, alignment, constitutional ai, language model failures, misalignment, automated evaluation, automated red-teaming
Abstract: Humans often rely on subjective natural language to direct language models (LLMs); for example, users might instruct the LLM to write an *enthusiastic* blogpost, while developers might train models to be *helpful* and *harmless* using LLM-based edits. The LLM’s *operational semantics* of such subjective phrases---how it adjusts its behavior when each phrase is included in the prompt---thus dictates how aligned it is with human intent. In this work, we uncover instances of *misalignment* between LLMs' actual operational semantics and what humans expect. Our method, TED (thesaurus error detector), first constructs a thesaurus that captures whether two phrases have similar operational semantics according to the LLM. It then elicits failures by unearthing disagreements between this thesaurus and a human-constructed reference. TED routinely produces surprising instances of misalignment; for example, Mistral 7B Instruct produces more *harassing* outputs when it edits text to be *witty*, and Llama 3 8B Instruct produces *dishonest* articles when instructed to make the articles *enthusiastic*. Our results demonstrate that humans can uncover unexpected LLM behavior by scrutinizing relationships between abstract concepts, without supervising outputs directly.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5672
Loading