Overcoming Non-monotonicity in Transducer-based Streaming Generation

Zhengrui Ma; Yang Feng; Min zhang

Overcoming Non-monotonicity in Transducer-based Streaming Generation

Zhengrui Ma, Yang Feng, Min zhang

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Streaming generation models are utilized across fields, with the Transducer architecture being popular in industrial applications. However, its input-synchronous decoding mechanism presents challenges in tasks requiring non-monotonic alignments, such as simultaneous translation. In this research, we address this issue by integrating Transducer's decoding with the history of input stream via a learnable monotonic attention. Our approach leverages the forward-backward algorithm to infer the posterior probability of alignments between the predictor states and input timestamps, which is then used to estimate the monotonic context representations, thereby avoiding the need to enumerate the exponentially large alignment space during training. Extensive experiments show that our MonoAttn-Transducer effectively handles non-monotonic alignments in streaming scenarios, offering a robust solution for complex generation tasks. Code is available at https://github.com/ictnlp/MonoAttn-Transducer.

Lay Summary: Real-time speech recognition and translation systems face a critical challenge: they must start generating translations or transcriptions before the speaker finishes talking. Existing methods typically either use attention-based models, which require external policies to balance quality and latency, or Transducer models, which efficiently synchronize input and output but struggle when translations involve reordering words. Our research addresses the limitation of Transducer models by introducing a new method that combines them with a dynamic attention mechanism, allowing the model to better handle sentences where words don't align in the same order across languages. Specifically, we developed an efficient training algorithm that enables the model to attend to previously spoken words without needing to consider an overwhelming number of alignment possibilities. By applying our approach to simultaneous translation tasks, we demonstrated that it significantly improves the quality of real-time translations without sacrificing speed. This advancement is particularly beneficial when dealing with languages or speech scenarios that require more flexible word ordering, making real-time communication smoother and more accurate.

Link To Code: https://github.com/ictnlp/MonoAttn-Transducer

Primary Area: Applications->Language, Speech and Dialog

Keywords: streaming generation, simultaneous translation, Transducer

Submission Number: 9361

Loading