Refer to the Reference: Reference-focused Synthetic Automatic Post-Editing Data Generation

Published: 2025, Last Modified: 06 Feb 2025COLING 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: A prevalent approach to synthetic APE data generation uses source (src) sentences in a parallel corpus to obtain translations (mt) through an MT system and treats corresponding reference (ref) sentences as post-edits (pe). While effective, due to independence between ‘mt’ and ‘pe,’ these translations do not adequately reflect errors to be corrected by a human post-editor. Thus, we introduce a novel and simple yet effective reference-focused synthetic APE data generation technique that uses ‘ref’ instead of src’ sentences to obtain corrupted translations (mt_new). The experimental results across English-German, English-Russian, English-Marathi, English-Hindi, and English-Tamil language pairs demonstrate the superior performance of APE systems trained using the newly generated synthetic data compared to those trained using existing synthetic data. Further, APE models trained using a balanced mix of existing and newly generated synthetic data achieve improvements of 0.37, 0.19, 1.01, 2.42, and 2.60 TER points, respectively. We will release the generated synthetic APE data.
Loading