Abstract: The success of Large Language Models (LLMs) in other domains has raised the question of whether LLMs can reliably assess and manipulate the \textit{readability} of text. We approach this question empirically. First, using a published corpus of 4,724 English text excerpts, we find that readability estimates produced "zero-shot" from GPT-4 Turbo exhibit relatively high correlation with human judgments ($r = 0.76$), out-performing estimates derived from traditional readability formulas. Then, in a pre-registered human experiment ($N = 59$), we ask whether Turbo can reliably make text easier or harder to read. We find evidence to support this hypothesis, though considerable variance in human judgments remains unexplained. We conclude by discussing the limitations of this approach, including concerns about data contamination, as well as the validity of the "readability" construct and its dependence on context, audience, and goal.
Paper Type: Short
Research Area: NLP Applications
Research Area Keywords: readability, educational applications, large language models
Contribution Types: Data analysis
Languages Studied: English
Submission Number: 120
Loading