Objective Soups: Multilingual Multi-Task Modeling for Speech Processing

A F M Saif; Lisha Chen; Xiaodong Cui; Songtao Lu; Brian Kingsbury; Tianyi Chen

Objective Soups: Multilingual Multi-Task Modeling for Speech Processing

A F M Saif, Lisha Chen, Xiaodong Cui, Songtao Lu, Brian Kingsbury, Tianyi Chen

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multilingual speech recognition, automatic speech translation, multi-objective optimization, conflict-aware training

TL;DR: This paper proposes conflict-aware multi-objective training strategies for multilingual automatic speech recognition and speech translation by selectively aligning gradients from conflicting layers to enhance efficiency and performance.

Abstract: The need for training multilingual multi-task speech processing (MSP) models that perform both automatic speech recognition and speech-to-text translation is increasingly evident. However, a significant challenge arises from the conflicts among multiple objectives when using a single model. Multi-objective optimization can address this challenge by facilitating the optimization of multiple conflicting objectives and aligning the gradient updates in a common descent direction. While multi-objective optimization helps avoid conflicting gradient updates, a critical issue is that when there are many objectives, such as in MSP, it is often {\em difficult to find} a common descent direction. This leads to an important question: Is it more effective to separate highly conflicting objectives into different optimization levels or to keep them in a single level? To address this question, this paper investigates three multi-objective MSP formulations, which we refer to as \textbf{objective soup recipes}. These formulations apply multi-objective optimization at different optimization levels to mitigate potential conflicts among all objectives. To keep computation and memory overhead low, we incorporate a lightweight layer‑selection strategy that detects the most conflicting layers and uses only their gradients when computing the conflict‑avoidance direction. We conduct an extensive investigation using the CoVoST v2 dataset for combined multilingual ASR and ST tasks, along with the LibriSpeech and AISHELL-1 datasets for multilingual ASR, to identify highly conflicting objectives and determine the most effective training recipe among the three proposed multi-objective optimization algorithms.

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 19145

Loading