Advantage Actor-Critic Training Framework Leveraging Lookahead Rewards for Automatic Question Generation

TMLR Paper1074 Authors

19 Apr 2023 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Existing approaches in Automatic Question Generation (AQG) train sequence-to-sequence (seq2seq) models to generate questions from input passages and answers using the teacher-forcing algorithm, a supervised learning method, resulting in exposure bias and training-testing evaluation measure mismatch. Several works have also attempted to train seq2seq models for AQG using reinforcement learning, leveraging Monte-Carlo return-based policy gradient (PG) methods like REINFORCE with baseline. However, such Monte-Carlo return-based PG methods depend on sentence-level rewards, which limits the training to sparse and high-variance global reward signals. Temporal difference learning (TD)-based Actor-Critic methods can provide finer-grained training signals for solving text-generation tasks by leveraging subsequence-level information. However, only a few works have explored the Actor-Critic methods for text generation because it becomes an additional challenge to train the seq2seq models steadily using such TD methods. Another severe issue is the vocabulary size-related intractable action space bottleneck inherent in all natural language generation (NLG) tasks. This work proposes an Advantage Actor-Critic training framework to train seq2seq models for AQG, which uses sub-sequence level information to train the models efficiently and stably. The proposed training framework also addresses the problems of exposure bias, evaluation measure mismatch and global rewards by facilitating the autoregressive token generation, BLEU-based task optimization and question prefix-based Critic signals and provides a workaround for the intractable action space bottleneck by leveraging relevant ideas from existing supervised learning and reinforcement learning literature. The training framework uses an off-policy approach for training the Critic, which prevents the Critic from overfitting the highly correlated on-policy training samples. The off-policy Critic training also uses an explicit division of high-reward and low-reward experiences, which provides additional improvement to the training process. In this work, we conduct experiments on multiple datasets from QG-Bench to show how the different components of our proposed Advantage Actor-Critic training framework work together to improve the quality of the questions generated by the seq2seq models by including necessary contextual information and ensuring that the generated questions have a high degree of surface-level similarity with the ground truth.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Matthew_Walter1
Submission Number: 1074
Loading