RepQA: Evaluating Readability of Large Language Models in Patient Education Question Answering

Tinglin Huang; Weikang Qiu; Ryan Rullo; Giselle OConnor; Yuchen Kuang; Ali Maatouk; S. Raquel Ramos; Rex Ying

RepQA: Evaluating Readability of Large Language Models in Patient Education Question Answering

Tinglin Huang, Weikang Qiu, Ryan Rullo, Giselle OConnor, Yuchen Kuang, Ali Maatouk, S. Raquel Ramos, Rex Ying

20 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Readability, Public Health, Patient Education

Abstract: Large Language Models (LLMs) have shown great promise in addressing complex medical and clinical problems. However, while most prior studies focus on improving accuracy and reasoning abilities, a significant bottleneck in developing effective healthcare agents lies in the readability of LLM-generated responses, specifically, their ability to answer patient education problems clearly to people without a medical background. In this work, we introduce RepQA, a benchmark designed to systematically evaluate the readability of LLMs in patient education question-answering (QA). RepQA comprises 533 expert-reviewed QA pairs sourced from 27 online resources, reflecting common concerns of lay users across 4 categories. RepQA incorporates a proxy multiple-choice QA task to directly assess the informativeness of generated responses, alongside two readability metrics. We present a comprehensive study of 25 LLMs on RepQA, evaluating both instruction-following (answering at requested reading levels) and readability understanding (recognizing the level of questions and texts). We find that current models fail to achieve target levels and struggle to identify appropriate reading levels, revealing a gap between raw reasoning ability and effective communication. We then compare four approaches for improving readability: standard prompting, chain-of-thought prompting, Group Relative Policy Optimization (GRPO), and a token-adapted GRPO. While readability-aware post-training substantially improves readability metrics, it often reduces QA accuracy due to over-simplification, exposing a significant readability–accuracy trade-off. These findings point toward methods for building more user-centered public health agents.

Primary Area: datasets and benchmarks

Submission Number: 23321

Loading