TL;DR: Introducing Sortformer, a novel model that resolves speaker permutations in multi-speaker ASR using Sort Loss and arrival-time sorting, enabling seamless integration of speaker recognition into speech-to-text systems.
Abstract: Sortformer is an encoder-based speaker diarization model designed for supervising speaker tagging in speech-to-text models. Instead of relying solely on permutation invariant loss (PIL), Sortformer introduces Sort Loss to resolve the permutation problem, either independently or in tandem with PIL. In addition, we propose a streamlined multi-speaker speech-to-text architecture that leverages Sortformer for speaker supervision, embedding speaker labels into the encoder using sinusoidal kernel functions. This design addresses the speaker permutation problem through sorted objectives, effectively bridging timestamps and tokens to supervise speaker labels in the output transcriptions. Experiments demonstrate that Sort Loss can boost speaker diarization performance, and incorporating the speaker supervision from Sortformer improves multi-speaker transcription accuracy. We anticipate that the proposed Sortformer and multi-speaker architecture will enable the seamless integration of speaker tagging capabilities into foundational speech-to-text systems and multimodal large language models (LLMs), offering an easily adoptable and user-friendly mechanism to enhance their versatility and performance in speaker-aware tasks. The code and trained models are made publicly available through the NVIDIA NeMo Framework.
Lay Summary: Transcribing multiple speakers with speaker tagging involves a problem of mapping which speaker belongs to which prediction the model generated. This issue, known as the "permutation problem," often requires a complicated system to resolve predicted speakers and real-life speakers. To address this, we developed Sortformer, a Transformer-encoder-based model that sorts speech segments by the order in which speakers first appear, effectively resolving permutation problems. Sortformer introduces "Sort Loss," which trains the model to order speech by arrival time.
Our approach seamlessly integrates into existing speech recognition systems, requiring minimal adjustments while significantly enhancing their accuracy. Sortformer and arrival-time-ordered multi-speaker transcription make multi-speaker ASR model training much easier, since it only requires exactly the same training framework as that used in monoaural ASR models, without needing specialized permutation-oriented loss calculations. By reducing complexity, Sortformer helps speech-to-text technologies become more robust and user-friendly. Ultimately, this work allows everyday applications—like virtual meeting transcriptions or smart assistants—to better understand group conversations, paving the way for clearer communication and richer interaction experiences.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://huggingface.co/nvidia/diar_sortformer_4spk-v1
Primary Area: Applications->Language, Speech and Dialog
Keywords: Automatic Speech Recognition, Speech to Text, Speaker Diarization, Multi-speaker ASR, Multi-talker ASR
Submission Number: 12615
Loading