Abstract: According to the internationally recognized PIRLS (Progress in International Reading Literacy Study) assessment standards, reading comprehension questions should encompass all four comprehension processes: retrieval, inferencing, integrating and evaluation. This paper investigates whether Large Language Models can produce high-quality questions for each of these categories. Human assessment on a Chinese dataset shows that GPT-4o can generate usable and category-specific questions, ranging from 74% to 90% accuracy depending on the category.
Paper Type: Short
Research Area: Generation
Research Area Keywords: human evaluation
Contribution Types: NLP engineering experiment
Languages Studied: Chinese
Submission Number: 803
Loading