TL;DR: Scaling laws for multilingual ASR and ST models. Largest model is 18B parameters trained on 360K hours of ASR/ST data.
Abstract: Neural scaling laws offer valuable insights for designing robust sequence processing architectures. While these laws have been extensively characterized in other modalities, their behavior in speech remains comparatively underexplored. In this work, we introduce OWLS, an open-access, reproducible suite of multilingual speech recognition and translation models spanning 0.25B to 18B parameters, with the 18B version being the largest speech model, to the best of our knowledge. OWLS leverages up to 360K hours of public speech data across 150 languages, enabling a systematic investigation into how data, model, and compute scaling each influence performance in multilingual speech tasks. We use OWLS to derive neural scaling laws, showing how final performance can be reliably predicted when scaling. Scaling to larger models can improve ASR performance across the board, in both low and high resource languages, improving the accessibility of speech technologies. Finally, we show how OWLS can be used to power new research directions by discovering emergent abilities in large-scale speech models. Model checkpoints will be released on https://huggingface.co/collections/espnet/owls-scaling-laws-for-speech-recognition-and-translation-67ab7f991c194065f057ce8d for future studies.
Lay Summary: Speech-to-text is one of the core technologies that allow computers to perceive and understand our physical world, transforming human speech encoded by sound waves into readable text. Development of AI-powered speech-to-text has traditionally focused on efficient models that can fit on small devices, such as smartphones, in contrast to power-hungry LLMs like ChatGPT that require dedicated data centers.
In this paper, we study the effects of scaling speech-to-text to LLM-level compute. We find that as speech-to-text models get larger, their ability to not only transcribe and translate speech improves in a predictable manner. In other words, we can estimate the performance of a larger model from the performance of smaller ones. Furthermore, we find that larger models have "emergent abilities" unfound in smaller ones, such as being able to implicitly recognize different dialects and learn to transcribe new languages on the fly.
We release all of our models for free, to allow researchers to better understand the capabilities of large AI models.
Primary Area: Applications->Language, Speech and Dialog
Keywords: Automatic Speech Recognition, Scaling Laws, Speech Translation, Multilingual
Submission Number: 4108
Loading