Improving Question Generation Quality for Educational LLMs: Evaluation Framework Construction and Fine-Tuning Optimization

ACL ARR 2026 January Submission4712 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Educational NLP, Large Language Model-based Question Generation, Benchmark Evaluation, Long Chain-of-Thought Reasoning
Abstract: The application of large language models (LLMs) in the field of education is becoming increasingly widespread, and question generation based on LLMs has begun to attract more attention. This is because it can save educators' time and enable personalized learning. However, existing studies mostly focus on the local rationality of the content generated by models, lacking a systematic comparison between the generated questions and human-crafted questions in terms of their overall characteristics. This work proposes an evaluation framework that covers both the content and form of questions, which comprehensively measures the gap between question generation by LLMs and that by humans, and puts forward a series of improvement methods targeting this gap. Specifically, we focus on seven dimensions to measure the differences between question generation by humans and by models. Based on these differences, we propose a zero-shot method (Chain-of-Thought Prompting for Question Generation, CPQG) that does not rely on external knowledge bases. By combining CoT and prompt engineering, CPQG significantly improves the model's own question generation quality. Extensive experiments show that CPQG has significantly narrowed the gap between question generation by models and by humans. Compared with the baseline, CPQG enables 7B-sized models to outperform the baseline by an average of 10\,\% and even surpassed GPT4 in multiple dimensions.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: educational applications,mathematical NLP
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 4712
Loading