Abstract: State-of-the-art approaches to Query-by-Example speech search are usually based on acoustic word embedding (AWE), representing variable-length speech segments with fixed-dimensional vectors using metric learning. In this paper, we propose a novel AWE network based on Mamba, a new state space model architecture for information-dense data. In order to further mitigate the mismatch between the training phase and the testing phase of AWE-based models, we also propose a novel audio segment padding method used in the training phase, Random Offset Mixed Padding. Comparisons with the state-of-the-art methods show that our method, Mamba-based network together with Random Offset Mixed Padding, achieves the best performance with significantly fewer parameters. We also demonstrate that the proposed padding method can enhance the performance of AWE-based baseline models.
Loading