An Exam for Anyone on Anything: LLM-Based Textbook Data Transformation

ACL ARR 2024 June Submission2660 Authors

15 Jun 2024 (modified: 19 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Supervised learning traditionally depends on labeled data, collected and organized for specific tasks. Producing these datasets has generally been time-consuming, costly and error-prone. The emergence of large language models (LLMs) demonstrate a remarkable ability to produce well-formatted data, which could potentially revolutionize the dataset construction process. In this paper, we propose an LLM-based data transformation pipeline to generate multiple-choice question-answer (MCQA) data from raw sources such as textbooks. Furthermore, we extend this process by proposing a pseudo-open-book reasoning approach, wherein student LLMs are trained to first recreate the original textbook excerpts used to generate the questions, before answering them. We evaluate our methods using the Llama2 13B model on domain-specific subsections from the MMLU testing set, and observe an improvement of up to 18.8% in testing accuracy, increasing from 45.8% to 64.6%, without accessing the corresponding MMLU training set.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: text-to-text generation, reasoning, question generation
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 2660
Loading