Multilingual Contextualization of Large Language Models for Document-Level Machine Translation

Miguel Moura Ramos; Patrick Fernandes; Sweta Agrawal; Andre Martins

Multilingual Contextualization of Large Language Models for Document-Level Machine Translation

Miguel Moura Ramos, Patrick Fernandes, Sweta Agrawal, Andre Martins

Published: 08 Jul 2025, Last Modified: 26 Aug 2025COLM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Machine Translation, Long Context, Multi-Paradigm Translation Dataset Curation, Instruction Tuning

TL;DR: We show that fine-tuning LLMs with multi-paradigm instructions from our curated DocBlocks dataset significantly improves document-level translation, outperforming prompting and agent-based methods while preserving sentence-level performance.

Abstract: Large language models (LLMs) have demonstrated strong performance in sentence-level machine translation, but scaling to document-level translation remains challenging, particularly in modeling long-range dependencies and discourse phenomena across sentences and paragraphs. In this work, we propose a method to improve LLM-based long-document translation through targeted fine-tuning on high-quality document-level data, which we curate and introduce as DocBlocks. Our approach supports multiple translation paradigms, including direct document-to-document and chunk-level translation, by integrating instructions both with and without surrounding context. This enables models to better capture cross-sentence dependencies while maintaining strong sentence-level translation performance. Experimental results show that incorporating multiple translation paradigms improves document-level translation quality and inference speed compared to prompting and agent-based methods.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 1159

Loading