From Babel to Brilliance: De-Noising Techniques for Cross-Lingual Sentence-Difficulty Classifiers

ACL ARR 2025 February Submission3291 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Noisy training data can significantly degrade the performance of classifiers using language models, particularly in applications such as readability assessment, content moderation, and language learning tools. This study investigate the use of several de-noising techniques in sentence-level difficulty detection, using a training set derived from document-level difficulty annotation. In addition to monolingual de-noising, we address the cross-lingual transfer gap when a multilingual language model is trained on one language and tested on another. We have examined the influence of segment lengths and have studied a wide range of noise reduction techniques, such as Gaussian Mixture Models, Co-Teaching, Noise Transition Matrices, and Label Smoothing. Results reveal that, while BERT-like models are robust to noise, incorporating noise detection can further enhance performance. For a smaller dataset, Gaussian Mixture Models can be especially helpful to reduce noise and improve prediction quality, especially in the cross-lingual transfer. However, for a larger dataset the inherent regularisation of the PLMs provides a good baseline, which (fairly expensive) de-noising methods cannot improve further.
Paper Type: Long
Research Area: Sentiment Analysis, Stylistic Analysis, and Argument Mining
Research Area Keywords: Readability, De-noising, Cross-lingual Transfer, Regularization
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Data resources, Data analysis
Languages Studied: English, French, Catalan, Spanish, italian, Russian
Submission Number: 3291
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview