A Study of Optimization Imbalance Across Languages in Multilingual ASR Models

Published: 26 Aug 2025, Last Modified: 26 Aug 2025SpeechAI TTIC 2025 OralorPosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multilingual ASR
TL;DR: We investigate optimization imbalance across languages in massive multilingual ASR models, and try to alleviate the issue with multi-task learning algorithms.
Presentation Preference: No
Abstract: Recent advances in multilingual automatic speech recognition (ASR), with models such as Whisper, MMS, SeamlessM4T, OWSM, OWLS, etc, have extended the number of supported languages to 100+ and the number of model parameters to 1B+. While scaling up model and data sizes helps alleviate the curse of multilinguality and improves overall performance, several issues linger. In particular, we explore whether (large-scale) massive multilingual ASR models suffer from optimization imbalance across languages, and whether model performance be further boosted by resolving such issues. Importantly, any solution must be efficient and scalable with respect to both the number of model parameters and the number of languages. The curse of multilinguality has been a long-standing challenge in multilingual natural language and speech processing. Put simply, it states that when model capacity is fixed, supporting more languages can hurt performance in some languages -- it happens because languages compete for model capacity. Apart from increasing model capacity, another possible direction is to proactively resolve optimization imbalance between languages. Existing approaches include adding language-specific modules (adapters, MoE) to reduce negative transfer between languages, and applying multi-task learning algorithms that better balance the training dynamics across languages. However, training language-specific modules can be troublesome as the number of languages grows. Grouping ``similar" languages can partly resolve the issue, but it raises a new question of how to find optimal groups. Therefore, this work focuses on the multi-task learning aspect. From a multi-task learning perspective, multilingual ASR can be framed as learning multiple tasks simultaneously, with each language treated as a separate task. In this view, task imbalance may arise when some tasks/languages are severely under-optimized. To address this, a series of optimization algorithms have been proposed. Some modify the gradients to resolve gradient conflicts (PCGrad, Gradient Vaccine), while some reweight task losses to balance optimization (FAMO, GEO4Align). To the best of our knowledge, there have been few works trying to apply them in the context of multilingual ASR. Moreover, most of the optimization methods do not scale well, as they introduce significant memory and/or computation overhead when the number of tasks and/or the model size increases. This makes them difficult to apply in multilingual ASR, where 100+ tasks are present and models have 1B+ parameters. In this project, we: - Investigate whether language/task imbalance exists in massive multilingual ASR models. We will do so by investigating gradient conflicts and training dynamics. - Incorporate multi-task optimization methods into multilingual ASR training, and evaluate their impact on both overall and per-language performance. - Explore the development of more efficient alternatives if the current methods are not suitable for multilingual ASR. - Evaluate scalability of the methods with respect to the number of languages, data size, and model size. - Explore broader applications: if the direction can be proved effective, we will consider extending it to multilingual self-supervised training, as well as multilingual multi-task models with ASR and speech translation support. We hope to present preliminary results for the first two aspects at the time of the workshop, and seek feedback from the community on the rest of the project.
Submission Number: 33
Loading