HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes

Published: 26 Mar 2024, Last Modified: 26 Mar 2024Accepted by TMLREveryoneRevisionsBibTeX
Abstract: Vector quantization (VQ) is a technique to deterministically learn features with discrete codebook representations. It is commonly performed with a variational autoencoding model, VQ-VAE, which can be further extended to hierarchical structures for making high-fidelity reconstructions. However, such hierarchical extensions of VQ-VAE often suffer from the codebook/layer collapse issue, where the codebook is not efficiently used to express the data, and hence degrades reconstruction accuracy. To mitigate this problem, we propose a novel unified framework to stochastically learn hierarchical discrete representation on the basis of the variational Bayes framework, called hierarchically quantized variational autoencoder (HQ-VAE). HQ-VAE naturally generalizes the hierarchical variants of VQ-VAE, such as VQ-VAE-2 and residual-quantized VAE (RQ-VAE), and provides them with a Bayesian training scheme. Our comprehensive experiments on image datasets show that HQ-VAE enhances codebook usage and improves reconstruction performance. We also validated HQ-VAE in terms of its applicability to a different modality with an audio dataset.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=1rowoeUM5E&referrer=%5BTMLR%5D(%2Fgroup%3Fid%3DTMLR)
Changes Since Last Submission: In response to the comments from the action editors in the last submission, we have added the following modifications to the current manuscript. - We have conducted additional experiments regarding the application to image generation tasks, and thus added subsections about the experiments to Section 5 and Appendix D. In the subsections, our RSQ-VAE and SQ-VAE-2 are compared with the current VQ-based generative models using FFHQ and ImageNet. The experimental results are shown in Tables 3, 4, 6, 7, and Figure 15, 16. - We have enhanced the visibility of the error plots in Figures 2 and 3, and provided an explanation of how we compute the errors in Section 5. - We have added a section describing similarities/differences of SQ-VAE and HQ-VAE to Appendix A. - We have proofread the entire manuscript, including both the main and appendix sections.
Supplementary Material: zip
Assigned Action Editor: ~Ole_Winther1
Submission Number: 2006