Abstract: Data augmentation (DA) plays a vital role in improving model performance in low-resource or limited-supervision scenarios within Natural Language Processing (NLP).
Existing DA methods, such as synonym replacement and back-translation, have demonstrated effectiveness on the word or sentence level; however, they frequently neglect discourse-level coherence and logical flow, which are essential for complex tasks dependent on inter-sentential relationships.
In this paper, we propose a structure-preserving document-level data augmentation framework based on large language models (LLMs) and fine-grained discourse structure parsing.
Our approach identifies rhetorical relations between sentence pairs and extracts key phrases, which are then replaced with topic-unrelated content while preserving the original discourse structure.
Experimental results on text summarization and question answering show that training with data augmented by our method consistently outperforms the baseline, demonstrating the effectiveness of structure-preserving data augmentation for document-level NLP tasks.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: NLP Applications, Summarization, Question Answering, Discourse and Pragmatics
Languages Studied: English, Chinese
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: section 4.1
B2 Discuss The License For Artifacts: N/A
B3 Artifact Use Consistent With Intended Use: N/A
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: N/A
B6 Statistics For Data: N/A
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Section 4.3
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Section 4.3
C3 Descriptive Statistics: Yes
C3 Elaboration: Section 5.1
C4 Parameters For Packages: Yes
C4 Elaboration: Section 4
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: yes
Submission Number: 313
Loading