Abstract: Although recent neural speech codecs have achieved high-fidelity speech reconstruction, there are still limitations: (1) the symmetric encoder-decoder architecture results in low computational efficiency; (2) downsampling layers in encoders usually cause sampling loss. In this paper, we propose a dual-scale spectra fusion based asymmetric neural speech codec named DSF-ACodec. It employs a one-branch encoder to encode high-resolution amplitude and phase spectra, while a powerful two-branch decoder reconstructs the high-resolution spectra in parallel. The decoded speech is then generated through the inverse short-time Fourier transform (ISTFT). Such an asymmetric architecture reduces the encoder parameters, effectively improving computational efficiency. Furthermore, we introduce a Spectra-based Skip Connection Module (SSCM), which fuses low-resolution amplitude and phase spectra with encoded high-resolution spectra features, successfully mitigating sampling loss. Experimental results demonstrate that DSF-ACodec achieves higher speech reconstruction quality compared to the baseline model, APCodec, while reducing the encoder parameters by approximately 28.6%.
External IDs:dblp:conf/icdsp/LiBJ25
Loading