- TL;DR: We propose Answer-containing Sentence Generation (ASGen), a novel pre-training method for generating synthetic data for machine reading comprehension.
- Abstract: Numerous machine reading comprehension (MRC) datasets often involve manual annotation, requiring enormous human effort, and hence the size of the dataset remains significantly smaller than the size of the data available for unsupervised learning. Recently, researchers proposed a model for generating synthetic question-and-answer data from large corpora such as Wikipedia. This model is utilized to generate synthetic data for training an MRC model before fine-tuning it using the original MRC dataset. This technique shows better performance than other general pre-training techniques such as language modeling, because the characteristics of the generated data are similar to those of the downstream MRC data. However, it is difficult to have high-quality synthetic data comparable to human-annotated MRC datasets. To address this issue, we propose Answer-containing Sentence Generation (ASGen), a novel pre-training method for generating synthetic data involving two advanced techniques, (1) dynamically determining K answers and (2) pre-training the question generator on the answer-containing sentence generation task. We evaluate the question generation capability of our method by comparing the BLEU score with existing methods and test our method by fine-tuning the MRC model on the downstream MRC data after training on synthetic data. Experimental results show that our approach outperforms existing generation methods and increases the performance of the state-of-the-art MRC models across a range of MRC datasets such as SQuAD-v1.1, SQuAD-v2.0, KorQuAD and QUASAR-T without any architectural modifications to the original MRC model.
- Keywords: Question Answering, Machine Reading Comprehension, Data Augmentation, Question Generation, Answer Generation