Abstract: Most state-of-the-art keyphrase generation methods are based on the Seq2Seq models and rely on large-scale annotated data. In this paper, we design a data expansion technology (namely DEKG) based on large-scale unlabeled documents for resource-constrained domains where only a small amount of annotated data is available. DEKG mainly consists of two parts: Contextual Multi-dimensional Generative Data Augmentation (CMGDA) and High-quality Pseudo-label Acquisition with Dual Model Filtering (HPADF). CMGDA first trains two generative models to augment annotated samples from two dimensions: present keyphrase and sentences, without changing the distribution of the number of present and absent keyphrases. Next, HPADF combines the synthetic data generated by CMGDA with original annotated samples to train two differently initialized keyphrase generation models. And then HPADF assigns unlabeled samples with pseudo-labels which have been filtered based on the predictions of two keyphrase generation models, thereby creating more synthetic data for model training. Finally, we test DEKG comprehensively on five datasets and show that the DEKG consistently improve the state-of-the-art performance.
Loading