Abstract: Minimum Bayes Risk (MBR) decoding can significantly improve translation performance of Multilingual Large Language Models (MLLMs). However, MBR decoding is com- putationally expensive. We show how the re- cently developed Reinforcement Learning tech- nique, Direct Preference Optimization (DPO), can fine-tune MLLMs to get the gains of MBR without any additional computation in infer- ence. Our method uses only a small mono- lingual fine-tuning set and yields significantly improved performance on multiple NMT test sets compared to MLLMs without DPO.
Loading