Abstract: Data augmentation is commonly used in training in low-resource scenarios. However, there are sometimes large discrepancy between distributions of augmented data and target data. How to bridge the gap between the augmented and target data, especially when target data is harder-to-learn? In this paper, we study improved data augmentation strategies in the scenario of scientific slides text summarization, where we generate a textual summary based on texts of presentation slides. Since slides are messy and difficult to understand by current models, we introduce an easier form of data, i.e., articles in natural language. The basic idea is that we generate the transition data between slides and articles, and all three of them form a curriculum for neural models to learn the distribution transition from article data to slides data. We find that our approach achieves consistent improvements over different backbone summarization models. The curriculum-oriented data augmentation method can generate data that fill the gap between the easy-to-obtain data and the low-resource task data. We show that curriculum learning and data augmentation can be combined to help NLP models learn from otherwise hard-to-learn data.
0 Replies