Testing English News Articles for Lexical Homogenization Due to Widespread Use of Large Language Models

Sarah Fitterer; Dominik Gangl; Jannes Ulbrich

Testing English News Articles for Lexical Homogenization Due to Widespread Use of Large Language Models

Sarah Fitterer, Dominik Gangl, Jannes Ulbrich

Published: 22 Jun 2025, Last Modified: 17 Jul 2025ACL-SRW 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: language, AI, LLM, homogenization, linguistics, lexis

TL;DR: This study compares English news articles from 2018 and 2024 to test whether widespread LLM adoption has led to lexical homogenization, finding no significant drop in lexical diversity but clear signs of increased LLM-style vocabulary.

Abstract: It is widely assumed that Large Language Models (LLMs) are shaping language, with multiple studies noting the growing presence of LLM-generated content and suggesting homogenizing effects. However, it remains unclear if these effects are already evident in recent writing. This study addresses that gap by comparing two datasets of English online news articles -- one from 2018, prior to LLM popularization, and one from 2024, after widespread LLM adoption. We define lexical homogenization as a decrease in lexical diversity, measured by the MATTR, Maas, and MTLD metrics, and introduce the LLM-Style-Word Ratio (SWR) to measure LLM influence. We found higher MTLD and SWR scores, yet negligible changes in Maas and MATTR scores in 2024 corpus. We conclude that while there is an apparent influence of LLMs on written online English, homogenization effects do not show in the measurements. We therefore propose to apply different metrics to measure lexical homogenization in future studies on the influence of LLM usage on language change.

Archival Status: Archival

Acl Copyright Transfer: pdf

Paper Length: Short Paper (up to 4 pages of content)

Submission Number: 349

Loading