Exploration of Serialization Orders in Multi-Talker ASR

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: speech recognition, multi-talker, serialized output training
Abstract: Serialized output training establishes a foundational framework for multi-talker automatic speech recognition and can be extended to speaker diarization using timestamp tokens. While the serialization policy is primarily defined by a chronological first-in-first-out order, we explore various alternative sorting rules, such as ordering speakers by their total speaking duration or average energy. We systematically study the learnability of these policies and their impact on recognition and diarization quality. Experiments show energy policies yield strong error rates but are harder to follow well. Conversely, the duration policy is easier to learn. We also show that these policies effectively identify a policy-defined primary speaker, maintaining a word error rate for this target speaker competitive with the baseline. Finally, we demonstrate that multi-task training successfully enables a single model to learn these diverse serialization policies together.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 90
Loading