Abstract: The Whisper model, although advanced in speech recognition, suffers from slow speed and high memory usage. This study aims to address these issues while improving the efficiency of speech recognition without compromising accuracy. To achieve this, we combined Low-Rank adaptation and Tucker decomposition strategy. By specializing to the characteristics of different languages and helping the model better adapt to multilingual environments, we significantly enhanced the models’ trade-off relationships between accuracy vs. model size. Experimental results demonstrate that our approach achieves rapid and accurate speech transcription, making significant progress in multilingual environments and substantially reducing the Word Error Rate, providing an innovative solution for practical applications in multilingual speech processing.
Loading