An Information-Theoretic Study of RLHF-Induced Uniformity in Large Language Model Outputs

An Information-Theoretic Study of RLHF-Induced Uniformity in Large Language Model Outputs

ACL ARR 2025 May Submission5874 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Reinforcement Learning with Human Feedback is an increasingly popular post-training procedure for Large Language Models (LLMs) to better align outputs with human values and increase output quality. As LLMs continue to be incorporated and improved for various modes of natural language communication, one might expect this audience-driven optimization to make their language increasingly converge toward that of human speakers. Thus, we investigate, through an information-theoretic lens, the changes in the "naturalness" of language in newer LLMs induced by fine-tuning and RLHF methods. On the basis of the Uniform Information Density (UID) Hypothesis, which posits that humans optimize their production of language to transfer information uniformly across a noisy channel, we analyze and compare how information is distributed within model-generated and human-generated text belonging to various domains. With two primary metrics of information uniformity, surprisal variance and local consistency, we find that RLHF seems to encourage less variance in information rates across generations, while fine-tuning decreases uniformity, shifting distributions slightly in the direction of human-generated text. However, models still exhibit significantly superhuman uniformity across various domains of text. Our results reveal that while modern LLM training and fine-tuning paradigms have made progress in approximating human-like information distributions, systematic differences persist.

Paper Type: Long

Research Area: Linguistic theories, Cognitive Modeling and Psycholinguistics

Research Area Keywords: cognitive modeling, language modeling, computational psycholinguistics, uniform information density, informatino theory

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

Keywords: cognitive modeling, language modeling, computational psycholinguistics, uniform information density, information theory

Submission Number: 5874

Loading