Abstract: Reinforcement Learning with Human Feedback is an increasingly popular post-training procedure for Large Language Models (LLMs) to better align outputs with human values and increase output quality. As LLMs continue to be incorporated and improved for various modes of natural language communication, one might expect this audience-driven optimization to make their language increasingly converge toward that of human speakers. Thus, we investigate, through an information-theoretic lens, the changes in the "naturalness" of language in newer LLMs induced by fine-tuning and RLHF methods. On the basis of the Uniform Information Density (UID) Hypothesis, which posits that humans optimize their production of language to transfer information uniformly across a noisy channel, we analyze and compare how information is distributed within model-generated and human-generated text belonging to various domains. With two primary metrics of information uniformity, surprisal variance and local consistency, we find that RLHF seems to encourage less variance in information rates across generations, while fine-tuning decreases uniformity, shifting distributions slightly in the direction of human-generated text. However, models still exhibit significantly superhuman uniformity across various domains of text. Our results reveal that while modern LLM training and fine-tuning paradigms have made progress in approximating human-like information distributions, systematic differences persist.
Paper Type: Long
Research Area: Linguistic theories, Cognitive Modeling and Psycholinguistics
Research Area Keywords: cognitive modeling, language modeling, computational psycholinguistics, uniform information density, informatino theory
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Keywords: cognitive modeling, language modeling, computational psycholinguistics, uniform information density, information theory
Submission Number: 5874
Loading