How Learning from Human Feedback Influences the Lexical Choices of Large Language Models

ACL ARR 2025 February Submission4918 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) are known to overuse certain terms like "delve" and "intricate." The exact reasons for these lexical choices, however, have been unclear. This study investigates the contribution of Learning from Human Feedback (LHF), under which we subsume Reinforcement Learning from Human Feedback and Direct Preference Optimization. We present a straightforward procedure for detecting the lexical preferences of LLMs that are potentially LHF-induced. Next, we more conclusively link LHF to lexical overuse than ever before by experimentally emulating the LHF procedure and demonstrating that participants systematically prefer text variants that include certain words. To address the overuse of such words, developers now have a clear starting point: LHF datasets. This lexical overuse may be seen as a sort of misalignment, though our study highlights the potential divergence between the lexical expectations of different populations – namely, LHF workers versus LLM users. Possible causes of these divergences include demographic differences and/or features of the feedback solicitation task. Our work challenges the view of artificial neural networks as impenetrable black boxes and emphasizes the critical importance of both data and procedural transparency in alignment research.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: data ethics,model bias/fairness evaluation,ethical considerations in NLP applications,transparency
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: English
Submission Number: 4918
Loading