Abstract: End-to-end neural speech coding achieves state-of-the-art performance by using residual vector quantization. However, it is a challenge to quantize the latent variables with as few bits as possible. In this paper, we propose SRCodec, a neural speech codec that relies on a fully convolutional encoder/decoder network with specifically proposed split-residual vector quantization. In particular, it divides the latent representation into two parts with the same dimensions. We utilize two different quantizers to quantize the low-dimensional features and the residual between the low- and high-dimensional features. Meanwhile, we propose a dual attention module in split-residual vector quantization to improve information sharing along both dimensions. Both subjective and objective evaluations demonstrate that the effectiveness of our proposed method can achieve a higher quality of reconstructed speech at 0.95 kbps than Lyra-v1 at 3 kbps and Encodec at 3 kbps.
Loading