The Devil Is in the Word Alignment Details: On Translation-Based Cross-Lingual Transfer for Token Classification Tasks

The Devil Is in the Word Alignment Details: On Translation-Based Cross-Lingual Transfer for Token Classification Tasks

ACL ARR 2024 December Submission521 Authors

14 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Translation-based strategies for cross-lingual transfer (XLT) such as translate-train---training on noisy target-language data translated from the source language---and translate-test---evaluating on noisy source-language data translated from the target language---are competitive XLT baselines. In XLT for token classification tasks, however, these strategies include label projection, the challenging step of mapping the labels from each token in the original sentence to its counterpart(s) in the translation. Although word aligners (WAs) are commonly used for label projection, their low-level design decisions have not been systematically investigated in translation-based XLT. Moreover, recent marker-based methods, which project labels by inserting tags around spans before (after) translation, claim to outperform \was in label projection for XLT. In this work, we revisit WAs for label projection, systematically investigating the effects that low-level design decisions have on token-level XLT, namely: (i) the algorithm for projecting labels between (multi-)token spans, (ii) filtering strategy for reducing the proportion of noisy data, and (iii) pre-tokenization of the translated sentence. We find that all of these have a substantial impact on downstream XLT performance and show that, with optimal choices, WA offers XLT performance comparable to that of marker-based methods. We then introduce a new projection strategy that ensembles translate-train and translate-test predictions and show that it substantially outperforms the marker-based projection. Crucially, we show that this ensembling also reduces sensitivity to low-level WA design choices, resulting in more robust XLT for token classification tasks.

Paper Type: Long

Research Area: Multilingualism and Cross-Lingual NLP

Research Area Keywords: cross-lingual transfer, less-resourced languages

Contribution Types: NLP engineering experiment

Languages Studied: Bambara, Ewé, Fon, Hausa, Igbo, Kinyarwanda, Luganda, Luo, Mossi, Chichewa, chiShona, Kiswahili, Setswana, Akan/Twi, Wolof, isiXhosa, Yorùrbá, isiZulu, Arabic, Danish, German, South-Tyrolean, Indonesian, Italian, Kazakh, Dutch, Serbian, Turkish, Chinese, Bengali, Finnish, Korean, Russian, Swahili, Telugu

Submission Number: 521

Loading