Improved AED with multi-stage feature extraction and fusion based on RFAConv and PSA

Bingbing Wang, Yangjie Wei, Zhuangzhuang Wang, Zekang Qi

Published: 2025, Last Modified: 14 Jun 2025Speech Commun. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Highlights•We propose to introduce the receptive-field attention convolutional module in the front-end feature extractor stage of the AED model for capturing the location and spatial information of the speech signal, which can effectively enhance the model feature extraction capability.•We propose a block integration approach at the encoder stage of the AED model to improve the accuracy of the encoder output features by effectively merging shallow and deep features through a pyramid squeeze attention.•In test experiments with Aishell-1, the effectiveness of our multi-stage approach is verified with various end-to-end speech recognition models, and the experiments demonstrate that our approach achieves lower word error rates without a language model.