Abstract: Although proper handling of discourse phenomena significantly contributes to the quality of machine translation (MT), improvements on these phenomena are not adequately measured in common translation quality metrics. Recent works in context-aware MT attempt to target a small set of these phenomena during evaluation. In this paper, we propose a methodology to identify translations that require context systematically, and use this methodology to both confirm the difficulty of previously studied phenomena as well as uncover new ones that have not been addressed in previous work. We then develop the \textbf{Mu}ltilingual \textbf{D}iscourse-\textbf{A}ware (MuDA) benchmark, a series of taggers for these phenomena in 14 different language pairs, which we use to evaluate context-aware MT. We find that commonly studied context-aware MT models make marginal improvements over context-agnostic models, which suggests these models do not handle these ambiguities effectively. We will release code and data to invite the MT research community to increase efforts on translation on discourse phenomena and languages that are currently overlooked.
Paper Type: long
0 Replies
Loading