Improving Question Generation Quality for Educational LLMs: Evaluation Framework Construction and Fine-Tuning Optimization

Improving Question Generation Quality for Educational LLMs: Evaluation Framework Construction and Fine-Tuning Optimization

ACL ARR 2026 January Submission4712 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Educational NLP, Large Language Model-based Question Generation, Benchmark Evaluation, Long Chain-of-Thought Reasoning

Abstract: The application of large language models (LLMs) in the field of education is becoming increasingly widespread, and question generation based on LLMs has begun to attract more attention. This is because it can save educators' time and enable personalized learning. However, existing studies mostly focus on the local rationality of the content generated by models, lacking a systematic comparison between the generated questions and human-crafted questions in terms of their overall characteristics. This work proposes an evaluation framework that covers both the content and form of questions, which comprehensively measures the gap between question generation by LLMs and that by humans, and puts forward a series of improvement methods targeting this gap. Specifically, we focus on seven dimensions to measure the differences between question generation by humans and by models. Based on these differences, we propose a zero-shot method (Chain-of-Thought Prompting for Question Generation, CPQG) that does not rely on external knowledge bases. By combining CoT and prompt engineering, CPQG significantly improves the model's own question generation quality. Extensive experiments show that CPQG has significantly narrowed the gap between question generation by models and by humans. Compared with the baseline, CPQG enables 7B-sized models to outperform the baseline by an average of 10\,\% and even surpassed GPT4 in multiple dimensions.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: educational applications,mathematical NLP

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 4712

Loading