Abstract: Silent speech interfaces (SSI) enable the generation of audio speech or readable texts without vocalization. Electromyography (EMG), being one of the possible source signals of SSI, demonstrates its superiority, particularly for individuals with vocal organ injuries. In this work, we propose a self-pretraining framework, i.e. emg2vec, in EMG-based SSI, including EMG-to-speech and EMG-to-text conversion. Our experiments reveal that self-pretraining yields improvements compared to plain supervised learning. Our experiments show that, compared to training the models from scratch, self-pretraining improves the downstream speech recognition word error rate (WER) relatively by 7.32% when utilizing the entire labeled dataset and by 5.18% when employing only a 20% fraction of the labeled data for supervised training. The improvement also happens in speech synthesis, but only by 2.91% when using 20% of training data.
External IDs:dblp:conf/embc/HouGSRS0S24
Loading