Self-Trained and Self-Purified Data Augmentation for RST Discourse Parsing

Longyin Zhang, Xin Tan, Fang Kong, Guodong Zhou

Published: 13 Feb 2025, Last Modified: 28 Mar 2025TASLPEveryoneRevisionsCC BY 4.0

Abstract: Rhetorical structure parsing has faced significant challenges in the past decade because of data scarcity. In order to alleviate this problem, previous research has explored human-engineered features, cross-lingual information, pre-trained language models, etc., and achieved sure success. In this work, we aim to relieve the problem of corpus size limitation through data augmentation (DA). Specifically, we introduce a novel method that combines a top-down parser trained on the small-scale RST-DT corpus and the large-scale Reuters data in a self-training fashion. In particular, we harness an adversarially trained data filter to purify the generated silver data greedily to achieve high-quality data augmentation. It is worth mentioning that the prior rhetorical structure knowledge memorized by the teacher system and the data purifier is not occlusive; it can be continuously updated with the self-training process. We implement the overall learning process end-to-end, which does not depend on any external discourse parser and needs minimal human intervention. Experimental results on RST-DT demonstrate that our RST parser, when enhanced with the proposed DA method, can significantly outperform the baseline systems on both gold standard and automatic EDUs.