Abstract: Highlights•We propose to introduce the receptive-field attention convolutional module in the front-end feature extractor stage of the AED model for capturing the location and spatial information of the speech signal, which can effectively enhance the model feature extraction capability.•We propose a block integration approach at the encoder stage of the AED model to improve the accuracy of the encoder output features by effectively merging shallow and deep features through a pyramid squeeze attention.•In test experiments with Aishell-1, the effectiveness of our multi-stage approach is verified with various end-to-end speech recognition models, and the experiments demonstrate that our approach achieves lower word error rates without a language model.
Loading