Abstract: The scarcity of question-answering data is one of the main bottlenecks restricting the development of intelligent education systems. In this paper, we proposes a new method called Book2QA, which integrates multiple medium-scale language models (e.g., 6B/13B) to cost-effectively generate high-quality question-answering data from textbook content. The Book2QA framework includes three main steps: book data preprocessing, question generation with subsequent filtering, and answer generation with subsequent filtering. Our experimental results demonstrate the fine-tuned model's performance in real scenarios, highlighting the effectiveness of the Book2QA method. Automatic evaluation and advanced LLM evaluation show that data generated by Book2QA can match or surpass data from models with hundreds of billions of parameters. We open-source our data and code at https://anonymous.4open.science/r/Book2QA-F795.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: Question Answering,Efficient/Low-Resource Methods for NLP,Generation,NLP Applications
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data resources
Languages Studied: English,Chinese
Submission Number: 1136
Loading