Abstract: Recently, many Transformer-based models have been applied to end-to-end speech translation because of their capability to model global dependencies. Position embedding is crucial in Transformer models as it facilitates the modeling of dependencies between elements at various positions within the input sequence. Most position embedding methods employed in speech translation such as the absolute and relative position embedding, often encounter challenges in leveraging relative positional information or adding computational burden to the model. In this letter, we introduce a novel approach by incorporating rotary position embedding into Transformer-based speech translation (RoPE-ST). RoPE-ST first adds absolute position information by multiplying the input vector with rotation matrices, and then implements relative position embedding through the dot-product of the self-attention mechanism. The main advantage of the proposed method over the original method is that rotary position embedding combines the benefits of absolute and relative position embedding, which is suited for position embedding in speech translation tasks. We conduct experiments on a multilingual speech translation corpus MuST-C. Results show that RoPE-ST achieves an average improvement of 2.91 BLEU over the method without rotary position embedding in eight translation directions.
Loading