Tower: An Open Multilingual Large Language Model for Translation-Related Tasks

Duarte Miguel Alves; José Pombal; Nuno M Guerreiro; Pedro Henrique Martins; João Alves; Amin Farajian; Ben Peters; Ricardo Rei; Patrick Fernandes; Sweta Agrawal; Pierre Colombo; José G. C. de Souza; Andre Martins

Tower: An Open Multilingual Large Language Model for Translation-Related Tasks

Duarte Miguel Alves, José Pombal, Nuno M Guerreiro, Pedro Henrique Martins, João Alves, Amin Farajian, Ben Peters, Ricardo Rei, Patrick Fernandes, Sweta Agrawal, Pierre Colombo, José G. C. de Souza, Andre Martins

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0

Research Area: Alignment, Data, LMs for everyone

Keywords: Machine Translation, Multilinguality, Adaptation, Continual Pretraining

TL;DR: We propose a recipe to tailor LLMs for translation workflows.

Abstract: While general-purpose large language models (LLMs) demonstrate proficiency on multiple tasks within the domain of translation, approaches based on open LLMs are competitive only when specializing on a single task. In this paper, we propose a recipe for tailoring LLMs to multiple tasks present in translation workflows. We perform continued pretraining on a multilingual mixture of monolingual and parallel data, creating TowerBase, followed by finetuning on instructions relevant for translation processes, creating TowerInstruct. Our model surpasses open alternatives on several relevant tasks and is competitive with general-purpose closed LLMs. We will release the Tower models, our specialization dataset, an evaluation framework for LLMs focusing on the translation ecosystem, and a collection of model generations on our benchmark.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 372

Loading