Title: RESTRUCTURING VECTOR QUANTIZATION WITH THE ROTATION TRICK

Abstract: Vector Quantized Variational AutoEncoders (VQ-VAEs) are powerful generative models that compress continuous inputs into discrete latent representations using a codebook. A critical challenge in VQ-VAE training is the non-differentiable nature of the vector quantization step, typically addressed by a Straight-Through Estimator (STE). However, the STE completely bypasses the quantization operation, leading to a loss of crucial geometric information and often resulting in sub-optimal performance, training instabilities, and under-utilized codebooks. In this work, we introduce the **Rotation Trick**, a novel and geometrically-informed method for propagating gradients through the vector quantization layer. Our approach smoothly transforms each encoder output into its corresponding codebook vector via a rotation and rescaling linear transformation. Crucially, this transformation is treated as a constant during backpropagation, allowing the relative magnitude and angle between the encoder output and its chosen codebook vector to be explicitly encoded into the gradient. This rich geometric information then propagates back to the encoder, enabling more effective and stable updates. We rigorously evaluate the Rotation Trick across 11 diverse VQ-VAE training paradigms and consistently demonstrate significant improvements in reconstruction metrics, codebook utilization, and quantization error. Our code is available at https://github.com/cfifty/rotation_trick.

Section: INTRODUCTION
Vector quantization (VQ) (Gray, 1984) is a fundamental technique for discretizing continuous vector spaces, mapping vectors to the nearest element within a finite set known as a codebook. While historically crucial in information theory (Cover, 1999), its non-differentiable nature poses a significant obstacle for modern deep learning architectures, preventing direct gradient flow during backpropagation (Huh et al., 2023). This challenge is particularly acute in Vector Quantized-Variational AutoEncoders (VQ-VAEs) (Van Den Oord et al., 2017), where a VQ layer quantizes the learned representation at the bottleneck. Despite their ubiquity in state-of-the-art generative modeling (Rombach et al., 2022; Dhariwal et al., 2020; Brooks et al., 2024), VQ-VAEs struggle with uninterrupted gradient flow from the decoder to the encoder due to this non-differentiable layer.

The prevailing solution is the "straight-through estimator" (STE) (Bengio et al., 2013), which approximates gradients by copying them directly from the decoder's input to the encoder's output, completely bypassing the quantization operation. This approximation, however, comes with substantial drawbacks: it frequently leads to sub-optimal model performance, significant training instabilities, and the undesirable phenomenon of "codebook collapse," where a large portion of the codebook vectors become under-utilized or entirely unused (Mentzer et al., 2023; Dhariwal et al., 2020). These issues severely limit the information capacity of the VQ-VAE's bottleneck. Current research efforts to mitigate these problems can be broadly categorized into two main areas: (1) methods that circumvent the STE and (2) methods that enhance codebook-model interactions.
Sidestepping the STE. Prior works have explored alternatives to deterministic vector quantization to address STE-related issues. Baevski et al. (2019) utilize the Gumbel-Softmax trick (Jang et al., 2016) to learn a categorical distribution over codebook vectors, which eventually hardens to a one-hot distribution. Gautam et al. (2023) propose quantization via a convex combination of codebook vectors, and Takida et al. (2022) introduce stochastic quantization. Diverging from probabilistic approaches, Huh et al. (2023) employ an alternating optimization scheme where the encoder learns representations close to codebook vectors, while the decoder reconstructs from fixed codebook inputs. Although these methods circumvent STE's training instabilities, they often introduce their own complexities, such as reduced codebook utilization during inference, the need for careful temperature schedule tuning (Zhang et al., 2023), or intricate multi-stage optimization. Consequently, a vast number of state-of-the-art generative modeling applications and research continue to rely on VQ-VAEs trained with the STE (Rombach et al., 2022; Chang et al., 2022; Huang et al., 2023; Zhu et al., 2023; Dong et al., 2023).

Codebook-Model Improvements. Another avenue for addressing codebook collapse or under-utilization involves modifying the codebook lookup mechanism or the codebook learning process itself. Examples include using cosine similarity (Yu et al., 2021) or hyperbolic metrics (Goswami et al., 2024) instead of Euclidean distance for codebook lookups, or stochastically sampling codes based on the distance between encoder output and codebook vectors (Lee et al., 2022). Other works focus on dynamic codebook management: Kolesnikov et al. (2022) split highly-used codebook vectors, while Dhariwal et al. (2020); Łańcucki et al. (2020); Zheng & Vedaldi (2023) re-initialize or "resurrect" under-utilized vectors. Chen et al. (2024) dynamically select from multiple codebooks, and Mentzer et al. (2023); Zhao et al. (2024); Yu et al. (2023); Chiu et al. (2022) fix codebook vectors to a predefined geometry, eliminating codebook learning altogether. Additionally, various loss penalties have been proposed to encourage codebook utilization, such as KL-divergence penalties (Zhang et al., 2023) or entropy loss terms (Yu et al., 2023). While these methods effectively target specific training difficulties, they fundamentally operate within the STE framework, meaning the inherent training instabilities and information loss caused by this estimator persist. The **Rotation Trick**, introduced in this paper, is orthogonal to these approaches and can significantly enhance their performance by providing a more informed gradient signal, as demonstrated across a wide range of VQ-VAE training paradigms in Section 5.

Section: STRAIGHT THROUGH ESTIMATOR (STE)
In this section, we review the Straight-Through Estimator (STE) and visualize its effect on the gradients. We then explore two STE alternatives that-at first glance-appear to correct the approximation made by the STE.
For notation, we define a sample space X over the input data with probability distribution p. For input x ∈ X , we define the encoder as a deterministic mapping that parameterizes a posterior distribution p E (e|x). The vector quantization layer, Q(•), is a function that selects the codebook vector q ∈ C nearest to the encoder output e. Under Euclidean distance, it has the form:
Q(q = i|e) = 1 if i = arg min 1≤j≤|C| ∥e -q j ∥ 2 0 otherwise
The decoder is similarly defined as a deterministic mapping that parameterizes the conditional distribution over reconstructions p D (x|q). As in the VAE (Kingma & Welling, 2013), the loss function follows from the ELBO with the KL-divergence term zeroing out as p E (e|x) is deterministic and the utilization over codebook vectors is assumed to be uniform. Van Den Oord et al. (2017) additionally add a "codebook loss" term ∥sg(e) -q∥ 2 2 to learn the codebook vectors and a "commitment loss" term β∥e -sg(q)∥ 2 2 to pull the encoder's output towards the codebook vectors. sg stands for stopgradient and β is a hyperparameter, typically set to a value in [0.25, 2]. For predicted reconstruction x, the optimization objective becomes:
L(x) = ∥x -x∥ 2 2 + ∥sg(e) -q∥ 2 2 + β∥e -sg(q)∥ 2 2
For the subsequent analysis, we primarily focus on the ∥x -x∥ 2 2 term, as the other two do not directly depend on the decoder's output. During backpropagation, the model must differentiate through the vector quantization layer. The total gradient ∂L ∂x is composed as ∂L ∂x = ∂L ∂q ∂q ∂e ∂e ∂x, where ∂L ∂q represents backpropagation through the decoder, ∂q ∂e through the vector quantization layer, and ∂e ∂x through the encoder. Since vector quantization is a non-smooth, non-differentiable operation, ∂q ∂e cannot be directly computed, thereby blocking gradient flow to the encoder.
To solve the issue of non-differentiability, the STE copies the gradients from q to e, bypassing vector quantization entirely. Simply, the STE sets ∂q ∂e to the identity matrix I in the backward pass:
∂L ∂x = ∂L ∂q I ∂e ∂x
The first two terms ∂L ∂q ∂q ∂e combine to ∂L ∂e which, somewhat misleadingly, does not actually depend on e. As a consequence, the location of e within the Voronoi partition generated by codebook vector q-be it close to q or at the boundary of the region-has no impact on the gradient update to the encoder.
An example of this effect is visualized in Figure 2 for two example functions. In the STE approximation, the "exact" gradient at the encoder output is replaced by the gradient at the corresponding codebook vector for each Voronoi partition, irrespective of where in that region the encoder output e lies. As a result, the exact gradient field becomes "partitioned" into 16 different regions-all with the same gradient update to the encoder-for the 16 vectors in the codebook.
Returning to our question, is there a better way to propagate gradients through the vector quantization layer? At first glance, one may be tempted to estimate the curvature at q and use this information to transform ∂q ∂e as q moves to e. This is accomplished by taking a second order expansion around q to approximate the value of the loss at e:
L e ≈ L q + (∇ q L) T (e -q) + 1 2 (e -q) T (∇ 2 q L)(e -q)
Then we can compute the gradient at the point e instead of q up to second order approximation with:
∂L ∂e ≈ ∂ ∂e L q + (∇ q L) T (e -q) + 1 2 (e -q) T (∇ 2 q L)(e -q) = ∇ q L + (∇ 2 q L)(e -q)
While computing Hessians with respect to model parameters are typically prohibitive in modern deep learning architectures, computing them with respect to only the codebook is feasible. Moreover as we must only compute (∇ 2 q L)(e -q), one may take advantage of efficient Hessian-Vector products implementations in deep learning frameworks (Dagréou et al., 2024) and avoid computing the full Hessian matrix.
Extending this idea a step further, we can compute the exact gradient ∂L ∂e at e by making two passes through the network. Let L q be the loss with the vector quantization layer and L e be the loss without vector quantization, i.e. q = e rather than q = Q(e). Then one may form the total loss L = L q + λL e , where λ is a small constant like 10 -6 , to scale down the effect of L e on the decoder's parameters and use a gradient scaling multiplier of λ -1 to reweigh the effect of L e on the encoder's parameters to 1. As ∂q ∂e is non-differentiable, gradients from L q will not flow to the encoder. While seeming to correct the encoder's gradients, replacing the STE with either approach will likely result in worse performance. This is because computing the exact gradient with respect to e is actually the AutoEncoder (Hinton & Zemel, 1993) gradient, the model that VAEs (Kingma & Welling, 2013) and VQ-VAEs (Van Den Oord et al., 2017) were designed to replace given the AutoEncoder's propensity to overfit and difficultly generalizing. Accordingly using either Hessian approximation or exact gradients via a double forward pass will cause the encoder to be trained like an AutoEncoder and the decoder to be trained like a VQ-VAE. This mis-match in optimization objectives is likely another contributing factor to the poor performance we observe for both methods in Table 1, and a deeper analysis into these characteristics is presented in Appendix A.3.

Section: THE ROTATION TRICK
As established in Section 3, updating the encoder's parameters by either approximating or exactly computing the gradient at the encoder's output is often suboptimal for VQ-VAEs. Furthermore, the STE inherently discards critical information: the precise location of `e` within its quantized region—whether it is close to `q` or at its boundary—has no influence on the gradient update to the encoder. We posit that explicitly leveraging this geometric relationship between `e` and `q` to transform gradients through ∂q ∂e could significantly benefit the encoder's gradient updates, offering a substantial improvement over the STE.
Geometrically, the core problem is how to effectively transport the gradient ∇ q L from `q` to `e`, and which characteristics of ∇ q L and `q` should be preserved during this process. The STE provides one answer: transport the gradient from `q` to `e` by preserving its direction and magnitude. This paper, however, proposes a fundamentally different and more insightful solution: transport the gradient such that the angle between ∇ q L and `q` is preserved as ∇ q L moves to `e`. We term this novel approach "the Rotation Trick." In Section 4.3, we rigorously demonstrate that preserving this angle imparts highly desirable properties to how points evolve within the same quantized region, leading to enhanced codebook utilization and reduced quantization error.

Section: FORMAL DEFINITION OF THE ROTATION TRICK AND ANGLE PRESERVATION
This section formally defines the Rotation Trick and elucidates its unique property of angle preservation. For an encoder output `e`, let `q = Q(e)` denote its corresponding codebook vector. As `Q(•)` is a non-differentiable operation, direct gradient flow through this layer is impossible during the backward pass. The STE addresses this by maintaining the direction and magnitude of the gradient ∇ q L as it moves from `q` to `e`, achieved through a "straight-through" approximation that effectively sets the gradient at the encoder output to the gradient at the decoder's input.
The Rotation Trick, however, introduces a fundamentally different parameterization. It casts the forward pass as a precise rotation and rescaling operation that aligns `e` with `q`. This is achieved as follows:
Gradient at Rotation Trick STE moves to e via the STE (middle) and rotation trick (right). The STE "copies-and-pastes" the gradient to preserve its direction while the rotation trick moves the gradient so the angle between q and ∇qL is preserved (proved in Appendix A.6).

The forward pass is defined as:
q = ∥q∥ ∥e∥ R constant e
Here, `R` denotes the rotation transformation that aligns `e` with `q`, and `∥q∥ ∥e∥` is a rescaling factor that matches `e`'s magnitude to `q`'s. Both `R` and `∥q∥ ∥e∥` are functions of `e`. To prevent differentiating through this dependency, they are treated as fixed constants (or detached from the computational graph in deep learning frameworks) during backpropagation. This crucial choice is further elaborated in Appendix A.7.
Although the Rotation Trick does not alter the output of the forward pass, it fundamentally redefines the backward pass. Instead of setting ∂q ∂e = I, as in the STE, the Rotation Trick sets ∂q ∂e to be a dynamic rotation and rescaling transformation:
∂ q ∂e = ∥q∥ ∥e∥ R
Consequently, ∂q ∂e is no longer a static identity matrix but dynamically adapts based on the precise position of `e` within the Voronoi partition of `q`. A key outcome of this formulation is that the angle between ∇ q L and `q` is rigorously preserved as ∇ q L propagates to `e`. This effect is vividly visualized in Figure 3. While the STE merely translates the gradient from `q` to `e`, the Rotation Trick actively rotates it to maintain this critical angular relationship. In essence, the Rotation Trick and STE are both gradient estimators, but they diverge in their choice of desiderata: the STE preserves direction and magnitude, while the Rotation Trick prioritizes the preservation of the angle between the gradient and the codebook vector as it flows around the non-differentiable vector quantization operation to the encoder.

Section: EFFICIENT ROTATION COMPUTATION
The rotation transformation `R` that aligns `e` with `q` within the plane spanned by both vectors can be computed with remarkable efficiency using Householder matrix reflections. We define the normalized vectors `ê = e ∥e∥` and `q = q ∥q∥`, the scaling factor `λ = ∥q∥ ∥e∥`, and the bisector vector `r = ê+q ∥ê+q∥`. The combined rotation and rescaling operation that aligns `e` to `q` is then simply:
q = λRe = λ(I -2rr T + 2qê T )e = λ[e -2rr T e + 2qê T e]
The detailed derivation of this formula is provided in Appendix A.5. This parameterization of the rotation is highly advantageous as it avoids the computationally expensive calculation of outer products, thereby consuming minimal GPU VRAM. Furthermore, our empirical evaluations in Section 5 demonstrated no discernible difference in wall-clock training time between VQ-VAEs trained with the STE and those trained with the Rotation Trick, underscoring its practical efficiency.

Section: VORONOI PARTITION ANALYSIS
In the context of lossy compression, optimal vector quantization is characterized by both low distortion (quantization error ∥e -q∥ 2 2) and high information capacity (codebook utilization) (Cover, 1999). As demonstrated in Section 5, VQ-VAEs trained with the Rotation Trick consistently achieve these desiderata—often reducing quantization error by an order of magnitude and substantially increasing codebook usage—compared to STE-trained VQ-VAEs. This section delves into the underlying mechanisms driving these improvements. We analyze how encoder outputs mapped to the same Voronoi region are updated. While the STE applies a uniform update to all points within a given partition, the Rotation Trick adaptively modifies the update based on each point's specific location within the Voronoi region. This allows it to either push points within the same region farther apart or pull them closer together, depending on the direction of the gradient vector. The former capability directly contributes to increased codebook usage, while the latter leads to lower quantization error.
Let θ be the angle between `e` and `q`, and ϕ be the angle between `q` and ∇ q L.
When ∇ q L and `q` point in the same general direction (i.e., -π/2 < ϕ < π/2), encoder outputs with a large angular distance to `q` are pushed farther away than they would be by a standard STE update. Figure 5 illustrates this effect: points with large angular distance (blue regions) move more significantly away from `q` compared to points with low angular distance (ivory regions). The top-right partitions of Figure 4 provide a visual example of this behavior. The two clusters of points at the boundary, which have a relatively large angle to the codebook vector, are actively pushed away, while the cluster of points with a small angle to the codebook vector moves cohesively with it. This ability to intelligently push points at the boundary out of a quantized region and into another is highly desirable for increasing codebook utilization, particularly when points are directed towards previously unused codebook vectors. This nuanced capability is absent in the STE, which applies a uniform displacement to all points within a region.
Conversely, when ∇ q L and `q` point in opposite directions (i.e., π/2 < ϕ < 3π/2), the distance among points within the same Voronoi region decreases as they are pulled towards the location of the updated codebook vector. This "pulling" effect is visualized in Figure 5 (green regions), with the bottom partitions of Figure 4 providing a concrete example. Unlike the STE update, which preserves inter-point distances, the Rotation Trick actively pulls points with high angular distances closer towards the post-update codebook vector. This capability is highly desirable for reducing quantization error and enabling the encoder to "lock on" (Van Den Oord et al., 2017) to a target codebook vector more effectively.
Collectively, these two capabilities create a powerful "push-pull" effect that simultaneously achieves both key desiderata of vector quantization: increasing information capacity and reducing distortion. Encoder outputs exhibiting a large angular distance to their chosen codebook vector are "pushed" by outwards-pointing gradients into other, potentially under-utilized, codebook regions, thereby significantly boosting codebook utilization. Concurrent with this, center-pointing gradients "pull" points loosely clustered around the codebook vector closer together, effectively "locking on" to the chosen codebook vector and substantially reducing quantization error.

Section: FURTHER ANALYSIS
The Appendix contains several supplementary analyses that further explore the properties and implications of the Rotation Trick. Specifically, Appendix A.2 presents a comparison between the Rotation Trick and the STE in a non-convex synthetic optimization example, providing additional insights into their distinct behaviors. Appendix A.4 investigates the Rotation Trick's behavior when encoder outputs and codebook vectors are far from the origin. Appendix A.8 offers an analysis of using a reflection-based transformation instead of a rotation, highlighting its potential drawbacks. Finally, Appendix A.9 delves into the effect of scaling the gradient's norm by ∥q∥ ∥e∥ and explores alternative scaling factors, contributing to a more complete understanding of our proposed method.

Section: EXPERIMENTS
In Section 4.3, our Voronoi partition analysis theoretically demonstrated how the Rotation Trick's unique gradient propagation mechanism could increase codebook utilization and reduce quantization error by adaptively updating points within the same Voronoi region. To empirically validate these theoretical benefits and assess their practical impact, this section presents a comprehensive experimental evaluation of the Rotation Trick across a wide array of VQ-VAE paradigms.
Our evaluation begins with fundamental image reconstruction tasks, training a VQ-VAE with the objective function proposed by Van Den Oord et al. (2017). We then progressively extend our analysis to more complex and state-of-the-art architectures, including VQGANs (Esser et al., 2021), the VQGANs specifically designed for latent diffusion models (Rombach et al., 2022), and the advanced ViT-VQGAN (Yu et al., 2021). Finally, we assess VQ-VAE reconstructions in the video domain, employing a TimeSformer (Bertasius et al., 2021) as both the encoder and decoder. Due to space constraints, the detailed video results are presented in Appendix A.1. In total, our rigorous empirical analysis encompasses 11 distinct VQ-VAE configurations. Crucially, for all experiments, the models, hyperparameters, and training settings are kept identical to ensure a fair comparison, with the sole difference being the method of handling ∂q ∂e during backpropagation. A complete description of these settings is provided in Appendix A.10.

Section: VQ-VAE EVALUATION
We initiate our experimental analysis with a fundamental evaluation: training a VQ-VAE to reconstruct images from the challenging ImageNet dataset (Deng et al., 2009). Following the methodology of Van Den Oord et al. (2017), our training objective is a linear combination of the reconstruction loss, codebook loss, and commitment loss:
L = ∥x -x∥ 2 2 + ∥sg(e) -q∥ 2 2 + β∥e -sg(q)∥ 2 2
where β is a hyperparameter scaling constant. Consistent with common practice, we omit the explicit codebook loss term from the objective and instead update the codebook vectors using an exponential moving average (EMA) with a decay rate of 0.8.
Evaluation Settings. For 256 × 256 × 3 input images, we conduct evaluations under two distinct settings: (1) compression to a latent space of dimension 32 × 32 × 32 with a codebook size of 1024, following Yu et al. (2021), and (2) compression to 64 × 64 × 3 with a codebook size of 8192, following Rombach et al. (2022). In both settings, we compare performance using both Euclidean and cosine similarity for codebook lookup. Evaluation Metrics. We meticulously log both training and validation set reconstruction metrics. Notably, we compute reconstruction FID (r-FID) (Heusel et al., 2017) and reconstruction IS (r-IS) (Salimans et al., 2016) on reconstructions from the full ImageNet validation set, serving as robust measures of reconstruction quality. Additionally, we quantify codebook usage (the percentage of codebook vectors activated per batch) as an indicator of the vector quantization layer's information capacity, and quantization error ∥e -q∥ 2 2 as a direct measure of distortion. Baselines. Our comparative analysis includes the standard STE estimator (VQ-VAE), stochastic quantization with Gumbel-Softmax (Baevski et al., 2019) (Gumbel VQ-VAE), the Hessian approximation method described in Section 3 (VQ-VAE w/ Hessian Approx), the exact gradient backward pass from Section 3 (VQ-VAE w/ Exact Gradients), and our proposed Rotation Trick (VQ-VAE w/ Rotation Trick). All methods share identical architectures, hyperparameters, and training settings, which are comprehensively summarized in Table 8 of the Appendix. Crucially, there is no functional difference in the forward pass among these methods; the distinctions lie solely in how gradients are propagated through ∂q ∂e during backpropagation. Results. Table 1 clearly presents our findings. We observe that employing the Rotation Trick consistently reduces the quantization error—often by an order of magnitude—and significantly improves codebook utilization, particularly in scenarios where it was initially low. Both results are in strong agreement with our theoretical Voronoi partition analysis in Section 4.3: the Rotation Trick effectively pushes points at the boundary of quantized regions towards under-utilized codebook vectors, while simultaneously condensing points loosely grouped around a codebook vector more tightly towards it. These two synergistic features demonstrably have a profound positive effect on reconstruction metrics, as training a VQ-VAE with the Rotation Trick substantially improves both r-FID and r-IS.
Furthermore, our experiments reveal that the Hessian Approximation and Exact Gradients approaches consistently yield poor reconstruction performance. While these methods purport to provide "more accurate" gradients to the encoder, training the encoder in a manner akin to a traditional AutoEncoder (Hinton & Zemel, 1993) likely leads to overfitting and impaired generalization capabilities. Moreover, the inherent mismatch in training objectives between the encoder and decoder under these methods is a significant aggravating factor, contributing to their observed poor performance.

Section: VQGAN EVALUATION
Advancing to the next level of architectural complexity, we rigorously evaluate the impact of the Rotation Trick on VQGANs (Esser et al., 2021). The VQGAN training objective is defined as:
L VQGAN = L Per + ∥sg(e) -q∥ 2 2 + β∥e -sg(q)∥ 2 2 + λL Adv
Here, L Per represents the perceptual loss, as introduced by Johnson et al. (2016), which replaces the L 2 loss typically used in VQ-VAE training. L Adv is a patch-based adversarial loss, conceptually similar to the adversarial loss employed in Conditional GANs (Isola et al., 2017). β is a constant weighting the commitment loss, while λ is an adaptively determined weight based on the ratio of ∇L Per to ∇L Adv with respect to the last layer of the decoder.
Experimental Settings. We evaluate VQGANs across two distinct and widely-used paradigms: (1) the configuration optimized for autoregressive modeling with Transformers, as detailed in Esser et al. (2021), and (2) the configuration tailored for latent diffusion models, as described in Rombach et al. (2022). The first setting utilizes the convolutional neural network architecture and default hyperparameters specified by Esser et al. (2021), while the second adheres to the specifications from Rombach et al. (2022). A comprehensive description of both training settings, including all hyperparameters, is provided in Table 9 of the Appendix.
Results. Our empirical results are meticulously presented in Table 2 for the autoregressive setting and Table 3 for the latent diffusion setting. Consistent with our findings in Section 5.1 for basic VQ-VAEs, we observe that training VQGANs with the Rotation Trick leads to a substantial decrease in quantization error and a marked improvement in codebook usage. Furthermore, reconstruction performance, as quantified on the validation set by the total loss, r-FID, and r-IS, is consistently improved across both complex modeling paradigms. These results underscore the Rotation Trick's robustness and efficacy in enhancing VQGAN training.

Section: VIT-VQGAN EVALUATION
Building upon the success of VQGANs, Yu et al. (2021) proposed the ViT-VQGAN, which integrates a Vision Transformer (ViT) (Dosovitskiy, 2020) as the backbone for both the encoder and decoder. This architecture further enhances performance by leveraging the powerful representational capabilities of Transformers. The ViT-VQGAN also incorporates factorized codes and L2 normalization on the input and output of the vector quantization layer to improve training stability and overall performance. Additionally, the authors modified the training objective by introducing a logit-Laplace loss and reinstating the L2 reconstruction error alongside the adversarial and commitment losses.
Experimental Settings. We closely follow the open-source implementation provided by https://github.com/thuanz123/enhancing-transformers and utilize the default model and hyperparameter settings for the small ViT-VQGAN. A complete and detailed description of the training settings can be found in Table 10 of the Appendix.
Results. Table 4 comprehensively summarizes our findings for the ViT-VQGAN. Consistent with our previous observations for VQ-VAEs in Section 5.1 and VQGANs in Section 5.2, we demonstrate that codebook utilization and key reconstruction metrics are significantly improved when training with the Rotation Trick. Notably, in this specific configuration, the quantization error remains roughly comparable to the baseline STE, suggesting that the primary benefits manifest through enhanced codebook dynamics and overall reconstruction quality.

Section: LIMITATIONS
While the Rotation Trick offers substantial improvements, it is important to acknowledge its limitations. A potential issue can arise when encoder outputs (`e`) or codebook vectors (`q`) are constrained to have near-zero norms (i.e., `∥e∥ ≈ 0` or `∥q∥ ≈ 0`). In such scenarios, the angle between `e` and `q` may become obtuse. When this occurs, the Rotation Trick can "over-rotate" the gradient ∇ q L as it is transported from `q` to `e`, leading to `∇ q L` and `∇ e L` pointing in divergent directions (i.e., the cosine of the angle between `∇ e L` and `∇ q L` becomes negative). This undesirable effect is visualized in Figure 6. This is problematic because, when the angle between `e` and `q` is obtuse, the Rotation Trick violates the crucial assumption that `∇ q L ≈ ∇ e L` when `e ≈ q`. This can lead to degraded performance compared to VQ-VAEs trained with the STE. Although obtuse angles between `e` and `q` are generally rare—as codebook vectors are inherently designed to be "angularly close" to their mapped inputs—any architectural or training constraint that forces codewords to have near-zero norms could exacerbate this limitation. Future work could explore adaptive mechanisms or regularization strategies to mitigate this specific scenario, such as dynamically adjusting the scaling factor or introducing angular constraints.

Section: CONCLUSION
In this work, we embarked on a comprehensive exploration of gradient propagation mechanisms through the non-differentiable vector quantization layer of VQ-VAEs. Our central finding is that preserving the angle—rather than solely the direction—between the codebook vector and the gradient induces profoundly desirable effects on how points within the same codebook region are updated. This geometrically-informed approach, which we term the Rotation Trick, leads to a synergistic "push-pull" effect that simultaneously enhances codebook utilization and reduces quantization error. These fundamental improvements translate directly into substantial gains in overall model performance. Across 11 diverse experimental settings, ranging from basic VQ-VAEs to advanced VQGANs and ViT-VQGANs, we consistently demonstrate that training VQ-VAEs with the Rotation Trick significantly improves their reconstruction quality. For instance, when applied to a VQGAN configuration used in latent diffusion, the Rotation Trick dramatically improved r-FID from 5.0 to 1.1 and r-IS from 141.5 to 200.2, while simultaneously reducing quantization error by two orders of magnitude and boosting codebook usage by an impressive 13.5x. These results underscore the Rotation Trick's efficacy in addressing core limitations of discrete representation learning and offer a robust pathway to more stable and performant VQ-VAE models.

Section: A.1 VIDEO EVALUATION
Expanding our analysis beyond the image modality, we evaluate the effect of the Rotation Trick on video reconstructions from the BAIR Robot dataset (Ebert et al., 2017) and the UCF101 action recognition dataset (Soomro, 2012). We adopt the quantization paradigm used by ViT-VQGAN, but replace the ViT with a TimeSformer (Bertasius et al., 2021) for both the encoder and decoder, as detailed in Appendix A.10.4.

Section: A.2 NUMERICAL SIMULATION
To supplement our analysis in Section 4.3, we include a numerical simulation of vector quantization for minimizing Himmelblau's function (Figure 7). This simulation tracks 100 gradient updates for both the STE and Rotation Trick gradient estimators, highlighting the distinct behaviors of each. Our simulation utilizes an Exponential Moving Average (EMA) with a decay rate of 0.8, as described in Van Den Oord et al. (2017), to update the codebook vectors, and a learning rate of 1e-3 for updating the pre-quantized points.

Section: A APPENDIX
Points for both the STE and the Rotation Trick simulation use the same random initialization for both codewords and pre-quantized vectors. The only difference is whether the STE or the Rotation Trick is used as the gradient estimator through the vector quantization operation.

Section: A.3 HESSIAN APPROXIMATION AND EXACT GRADIENT ANALYSIS
In this section, we expand upon our analysis in Section 3, providing further intuition as to why employing exact gradients or Hessian approximations of these gradients can lead to undesirable characteristics in VQ-VAE training. We begin by demonstrating how the Hessian approximates the exact gradient up to a second-order term via a Taylor series expansion. The loss L e can be written exactly as an infinite series around q:
L e = L q + (∇ q L) T (e -q) + 1 2 (e -q) T (∇ 2 q L)(e -q) + 1 6 (e -q) T ∇ 3 q L(e -q, e -q) + . . . .
Thus, the loss computed by the Hessian approximation differs from the loss computed with the exact gradients method by the remainder term resulting from truncating the Taylor series after the second term:
{L e } Hessian = L q + (∇ q L) T (e -q) + 1 2 (e -q) T (∇ 2 q L)(e -q)
When differentiating both of these losses to compute the gradients, the difference between the exact gradient update and the Hessian update is:
∂L e ∂e -{ ∂L e ∂e } Hessian = ∂ ∂e O(∥e -q∥ 3 )
where O(∥e -q∥ 3 ) = 1 6 (e -q) T ∇ 3 q L(e -q, e -q) + . . . . Notably, if the loss in each partition is perfectly quadratic, the exact gradient will precisely equal the Hessian approximation.
A critical observation is that when `q ≈ e`, `∇qL ≈ ∇eL` for both the STE and the Rotation Trick. However, because the Hessian approximation and exact gradients explicitly utilize the curvature of the loss surface to transport `∇qL` from `q` to `e`, the direction of the gradient can change substantially, even when `q` is very close to `e`. This is a significant departure from the intuitive behavior desired in vector quantization.
The Hessian-based approach described in Section 3 approximates the gradients to the encoder as if quantization did not occur, effectively mimicking the gradient used to update the encoder in the original AutoEncoder (Hinton & Zemel, 1993) model.
We now delve into specific instances where exact gradients or their Hessian approximations may produce undesirable behavior in vector quantization. A core inductive bias (Baxter, 2000) for effective vector quantization is that when `e` is "close" to `q`, their gradients should also be "close"—i.e., if `e ≈ q`, then `∇ e L ≈ ∇ q L`. Intuitively, if the distortion between `e` and `q` is minimal (meaning `q` is an excellent codeword for `e`), these points should ideally move together during a gradient update to maintain or further reduce distortion.
This fundamental assumption holds true for both the STE and Rotation Trick gradients. However, it can be severely violated by the Hessian approximation or exact gradient approaches, particularly when the local curvature around `q` is negative or when the Hessian is indefinite, forming a saddle point.
Figure 9 vividly illustrates three such problematic cases. Since neither the STE nor the Rotation Trick explicitly uses the loss surface to transport `∇ q L` from `q` to `e`, when `q ≈ e`, `∇ q L ≈ ∇ e L` is maintained. In stark contrast, approaches that leverage the curvature around `q`—such as the Hessian approximation or exact gradients—to find or approximate the loss at `e` can cause `∇ e L` to point in a drastically different direction from `∇ q L`, even when `q` is in close proximity to `e`. For instance, the top-left and bottom partitions of Figure 9 show gradients scattering as they move from `q` to points within these partitions due to negative curvature. A similar detrimental effect is observed in the top-right partition of Figure 9 due to the presence of a saddle point, highlighting the instability introduced by curvature-aware gradient transport in these scenarios.

Section: A.4 BEHAVIOR AWAY FROM THE ORIGIN
Unlike the STE, the Rotation Trick is not inherently invariant to the absolute location of the origin. In this section, we rigorously explore this characteristic and its implications for how points within the same Voronoi region are updated. For instance, consider a scenario where each codebook vector and encoder output in Figure 4 were shifted by a constant vector `d` such that all components become positive. The STE, by its nature, is invariant to such global translations. However, as these vectors are translated further away from the origin, the angle between `e` and `q` tends to decrease, consequently diminishing the distinct effect of the Rotation Trick. In the limit, as the displacement approaches infinity, the Rotation Trick smoothly converges to behave identically to the STE.
Figure 11: Illustration of codebook and encoder output shifted away from the origin by a constant vector d. The angle after the shift is smaller than the angle before the shift: θ < θ.
Let's consider a codebook vector `q` and an encoder output `e` separated by an angle θ. We define the translated vectors as `q = q + d` and `ê = e + d`, where `d` is a large displacement vector. Let θ be the angle between `q` and `ê`. This example is visualized in Figure 11. From the law of cosines, we have:
∥q -e∥ 2 = ∥q∥ 2 + ∥e∥ 2 -2∥q∥∥e∥ cos(θ)
and
∥q -ê∥ 2 = ∥q -e∥ 2 = ∥q∥ 2 + ∥ê∥ 2 -2∥q∥∥ê∥ cos θ
Substituting and simplifying, we find that:
cos θ = ∥q∥ 2 + ∥e∥ 2 -2∥q∥∥e∥ cos(θ) -∥q + d∥ 2 -∥e + d∥ 2 -2∥q + d∥∥e + d∥
Now, consider the case where `q` and `ê` are far from the origin, i.e., `∥d∥ >> ∥q∥, ∥e∥`. In this limit, we approximate:
cos θ ≈ -2∥d∥ 2 -2∥d∥ 2 = 1
This implies that as `∥d∥ → ∞`, `θ → 0`. Consequently, `∥q∥ ∥ê∥ → 1` and `R → I`, which precisely corresponds to the STE update. Thus, as points move sufficiently far from the origin, the Rotation Trick gracefully and smoothly transforms into the STE.
We visualize an example of this effect in Figure 10, where each point from Figure 4 is translated by a positive value of ten along each dimension. As predicted by our analysis, the "push" effect of the gradient in the top-right quadrant persists, but its magnitude is reduced, making it more similar to the STE update. Interestingly, the top-left partition now exhibits a "pull" behavior because the gradient's relative direction points towards the origin, causing points within this region to converge. Finally, the gradient in the bottom region is no longer directed towards the origin but becomes more orthogonal to the codebook vector. As a result, we observe more of a rotational effect applied to these points compared to the contraction depicted in Figure 4.

Section: A.5 HOUSEHOLDER REFLECTION TRANSFORMATION
For any given encoder output `e` and codebook vector `q`, the rotation `R` that precisely aligns `e` with `q` within the plane spanned by these two vectors can be computed with high efficiency using Householder matrix reflections.
Definition 1 (Householder Reflection Matrix). For a unit norm vector `a ∈ R d`, the matrix `I -2aa T ∈ R d×d` represents a reflection across the subspace (hyperplane) orthogonal to `a`.
Returning to the context of vector quantization with `q = [ ∥q∥ ∥e∥ R]e`, we can express `R` as the product of two Householder reflection matrices. These reflections collaboratively rotate `e` to `q` within their shared plane. Without loss of generality, assume `e` and `q` are unit norm vectors, and let `θ` be the angle between them. By setting `r = e+q ∥e+q∥` and performing algebraic simplification, we arrive at the following efficient form for `R`:
R = (I -2qq T )(I -2rr T ) = I -2qq T -2rr T + 4qq T rr T = I -2qq T -2rr T + 4q q T r r T = I -2qq T -2rr T + 4q q T e + q ∥e + q∥ r T = I -2qq T -2rr T 4q q T e + q T q ∥e + q∥ r T = I -2qq T -2rr T + 4q ∥q∥∥e∥ cos θ + ∥q∥∥q∥ ∥e + q∥ r T = I -2qq T -2rr T + 4q cos θ + 1 ∥e + q∥ r T = I -2qq T -2rr T + 4q ∥e + q∥ 2 2∥e + q∥ r T = I -2qq T -2rr T + 4∥e + q∥ 2 2∥e + q∥ qr T = I -2qq T -2rr T + 4∥e + q∥ 2 2∥e + q∥ 2 q(e + q) T = I -2qq T -2rr T + 2qe T + 2qq T = I -2rr T + 2qe T
This derivation demonstrates that the rotation matrix can be constructed without explicitly computing trigonometric functions, relying instead on vector operations and Householder reflections, which are numerically stable and computationally efficient.

Section: A.6 PROOF OF ANGLE PRESERVATION IN THE ROTATION TRICK
For a given encoder output `e` and its corresponding codebook vector `q`, we provide a formal proof demonstrating that the Rotation Trick rigorously preserves the angle between `∇ q L` and `q` as `∇ q L` is transported to `e`. Unlike the notation in the main text, which assumes `q ∈ R d×1`, this proof utilizes batch notation to accurately reflect its application in neural network training. Specifically, `q ∈ R b×d` and `R ∈ R b×d×d`, where `b` is the batch size and `d` is the dimension of the codebook vector.
Remark 3. The angle between `q` and `∇ q L` is preserved as `∇ q L` moves to `e`.
Proof. Without loss of generality, let us assume `∥e∥ = ∥q∥ = 1` for simplicity. Under the Rotation Trick's forward pass, we have:
q = eR T ∂q ∂e = R
The gradient at `e` in the backward pass will then be:
∇ e L = ∇ q L ∂q ∂e = ∇ q L [R]
Let `θ` be the angle between `q` and `∇ q L`, and `ϕ` be the angle between `e` and `∇ e L`. Using the definition of the Euclidean inner product, we can write:
∥∇ q L∥ cos θ = q [∇ q L] T = eR T [∇ q L] T = e [∇ q LR] T = e [∇ e L] T
= ∥∇ q L∥ cos ϕ
Since `∥∇ q L∥` is a non-zero scalar, it follows that `cos θ = cos ϕ`, which implies `θ = ϕ`. Therefore, the angle between `q` and `∇ q L` is indeed preserved as `∇ q L` moves to `e` under the Rotation Trick. This property is fundamental to the improved gradient dynamics observed.

Section: A.7 RATIONALE FOR TREATING R AND ||q|| ||e|| AS CONSTANTS
In the Rotation Trick, the rotation matrix `R` and the scaling factor `||q|| ||e||` are treated as constants and explicitly detached from the computational graph during the forward pass. This section provides a detailed explanation for this design choice.
The Rotation Trick computes the input to the decoder, `q`, subsequent to performing a non-differentiable codebook lookup on `e` to determine `q`. The operation is defined as:
q = ||q|| ||e|| Re
As previously established in Section 4, both `R` and `||q|| ||e||` are functions of `e` (and implicitly `q` via `Q(e)`). However, by utilizing the quantization function `Q(e) = q`, we can express both `||q|| ||e||` and `R` as a single, composite function of `e`:
f (e) = ∥Q(e)∥ ∥e∥ I -2 e + Q(e) ∥e + Q(e)∥ e + Q(e) ∥e + Q(e)∥ T + 2Q(e)e T = ∥q∥ ∥e∥ R
The Rotation Trick can then be written as:
q = f (e)e
Differentiating `q` with respect to `e` using the product rule would yield:
∂ q ∂e = f ′ (e)e + f (e)
However, the term `f ′ (e)` cannot be computed because its derivation would necessitate differentiating through `Q(e)`, which is inherently a non-differentiable codebook lookup operation. Consequently, we deliberately drop the `f ′ (e)e` term and utilize only `f (e)` as our approximation of the gradient through the vector quantization layer: `∂ q ∂e = f (e)`. This approximation is a crucial design decision. It is important to note that this approximation, while simplifying the gradient computation, still conveys significantly more geometric information about the vector quantization operation than the standard STE, which merely sets `∂ q ∂e = I`. The preserved angular and magnitude relationships, as discussed in Section 4, are maintained through `f(e)`, leading to superior training dynamics.

Section: A.8 THE REFLECTION TRICK
Figure 12: Illustration of how the gradient at q moves to e via the STE, the Rotation Trick, and the Reflection Trick. The Reflection Trick matches the behavior of the Rotation Trick when the gradient ∇qL is parallel to q. However, it will reverse the components of the gradients orthogonal to q for points in q's partition. This effect is illustrated in the bottom two rows of the rightmost column.

One might consider using a single reflection to align `e` to `q`, rather than a rotation. For instance, using the notation from Appendix A.5, by setting `r = e-q ∥e-q∥` and reflecting across the hyperplane orthogonal to this vector via the Householder reflection `(I -2rr T)`, `e` will be reflected to `q`. We denote this transformation as `R'` and define the forward pass as `q = ∥q∥ ∥e∥ R'e`. We term this alternative approach "the Reflection Trick."
The Reflection Trick, however, can lead to undesirable and unstable behavior during the backward pass. While it mimics the Rotation Trick's behavior when `∇ q L` is perfectly parallel to `q` (as illustrated in the top two rows of Figure 12 and the top-right and bottom regions of Figure 13), it fundamentally differs by reflecting the orthogonal components of the gradient across the hyperplane orthogonal to `e -q`. This means that if the quantized gradient points "left," the reflected gradient will point "right," and vice-versa. This reversal of orthogonal components is highly undesirable, especially for points with low distortion (i.e., `e ≈ q`). In such cases, the Reflection Trick can cause `e` to move *away* from `q` along the gradient components orthogonal to `q`, thereby increasing distortion for points that are otherwise a "good match." The top-left partition of Figure 13 vividly illustrates this problematic effect: the gradient pushes the codebook vector "left," while the points in that region are paradoxically pushed in the opposite direction of the gradient.
We experimentally evaluated this effect following the VQ-VAE evaluation paradigm from Table 1 and the VQGAN evaluation paradigm from Table 3. Even without training these models to full convergence due to GPU resource limitations, both paradigms consistently exhibited poor and unstable convergence when trained with the Reflection Trick. Specifically, after just one epoch, the validation loss was approximately 3x higher than that achieved by the Rotation Trick for both 8192 and 16384 codebook VQGANs (as in Table 3). For the Euclidean codebook model with latent shape 64 × 64 × 3 (as in Table 1), the validation loss was approximately 2x higher than the Rotation Trick after only 15 epochs. These results strongly suggest that the Reflection Trick introduces severe training instabilities and is not a viable alternative to the Rotation Trick.

Section: A.9 GRADIENT NORM SCALING IN THE ROTATION TRICK
In this section, we delve into a detailed analysis of the `∥q∥ ∥e∥` term within the Rotation Trick. While this norm rescaling is inherently necessary to transform `e` into `q` during the forward pass, one could potentially bypass this multiplicative factor by formulating the Rotation Trick additively as:
q = R constant e + (q -Re) constant
A plausible benefit of this alternative formulation is that `∂q ∂e = R`, which is an orthogonal transformation with a determinant of one. Such a transformation would neither shrink nor expand the vector space by a factor of `∥q∥ ∥e∥`. In this section, we meticulously analyze the differences between these two approaches and present both as specific instantiations within a more general family of rotation-based gradient approximations.
A.9.1 COMPARISON BETWEEN ∥q∥ ∥e∥ AND (q -Re)
A fundamental inductive bias in vector quantization is that when `e ≈ q`, then `∇ e L ≈ ∇ q L`. In simpler terms, when the distortion between `e` and `q` is minimal, the gradients for both `e` and `q` should be approximately equivalent. However, a problematic scenario arises when `∥e∥ ≈ 0` and a Euclidean metric is used for codebook lookup: the angle between `e` and `q` can become obtuse, as illustrated in Figure 6. In such an instance, the Rotation Trick, without proper scaling, could cause the gradient `∇ e L` to "over-rotate" and point away from `∇ q L`.
The inclusion of the `∥q∥ ∥e∥` gradient scaling factor effectively mitigates this issue. When `∥e∥ ≈ 0` and `∥e∥ < ∥q∥`, the norm of the gradient `∇ e L` is scaled up, actively pushing `e` away from the origin. This outward push increases the influence of the angle between `e` and `q` when computing the Euclidean distance:
∥e -q∥ = ∥e∥ 2 + ∥q∥ 2 -2∥e∥∥q∥ cos θ
Consequently, as `∥e∥` increases, `e` becomes more likely to map to a different `q` that forms an acute angle with it, promoting more stable and desirable gradient behavior.
Conversely, consider the case where `∥q∥ ≈ 0` and `∥e∥ > ∥q∥`. In this scenario, the update to `e` would effectively vanish because `∥q∥ ∥e∥ ≈ 0`. This behavior can also be beneficial, as a `q` close to the origin has a higher likelihood of forming an obtuse angle with `e`, which, as discussed, is an undesirable configuration.
We further investigate this factor through ablation experiments on VQ-VAEs and VQGANs. Table 6, which mirrors Table 1, summarizes our findings for VQ-VAEs, while Table 7, mirroring Table 3, presents results for VQGANs used in latent diffusion. In Table 6, we observe no significant difference in performance between using `q = ∥q∥ ∥e∥ Re` and `q = Re + (q -Re)` for VQ-VAE models. However, for the VQGAN results in Table 7, we find that employing the `∥q∥ ∥e∥` gradient scaling factor modestly but consistently improves performance, suggesting its importance in more complex generative models.
Table 6: Comparison of the Rotation Trick using `q = ∥q∥ ∥e∥ Re` with using `q = Re + (q -Re)` for VQ-VAE models. The experimental setting follows Table 1.
Table 7: Comparison of the Rotation Trick using `q = ∥q∥ ∥e∥ Re` with using `q = Re + (q -Re)` for VQGAN models. The models with codebook size 8192 were stopped after 2 epochs while the models with codebook size 16384 were stopped after 3 epochs.
A.9.2 GENERAL FAMILY OF ROTATION-BASED GRADIENT ESTIMATORS
Generalizing the additive and multiplicative formulations of the Rotation Trick, we propose a more comprehensive family of rotation-based gradient estimators:
q = γ(e)Re + (q -γ(e)Re)
where `γ(e)` is a function that determines the multiplicative scaling factor. In this framework, `q = ∥q∥ ∥e∥ Re` corresponds to `γ(e) = ∥q∥ ∥e∥`, and `q = Re + (q -Re)` corresponds to `γ(e) = 1`. This generalized formulation opens avenues for exploring other scaling factors, such as:
γ(e) = 1 8∥q -e∥ 2
We visualize the gradient fields for different formulations of `γ(e)` in Figure 14, showcasing the diversity of gradient behaviors achievable within this family.
It is highly probable that other, as yet unexplored, formulations of `γ(e)` could further enhance the training dynamics or performance of VQ-VAEs. Particularly exciting directions for future work include a priori fixing `γ(e)` to satisfy specific inductive biases or developing adaptive scaling factors that dynamically adjust `γ(e)` throughout training, akin to adaptive task weighting functions in multi-task learning (Kendall et al., 2018; Chen et al., 2018).

Section: A.10 TRAINING SETTINGS
We detail the training settings used in our experimental analysis in Section 5. While a text description can be helpful for understanding the experimental settings, our released code should be referenced to fully reproduce the results presented in this work.
A.10.1 VQ-VAE EVALUATION.
Table 8 summarizes the hyperparameters used for the experiments in Section 5.1. For the encoder and decoder architectures, we use the Convolutional Neural Network described by Esser et al. (2021).
The hyperparameters for the cosine similarity codebook lookup follow from Yu et al. (2021) and the hyperparameters for the Euclidean distance codebook lookup follow from the default values set in the Vector Quantization library from https://github.com/lucidrains/vector-quantize-pytorch. All models replace the codebook loss with the exponential moving average described in Van Den Oord et al. (2017) with decay = 0.8. The notation for both encoder and decoder architectures is adapted from Esser et al. (2021).
For the Gumbel VQ-VAE baseline, we follow the implementation of https://github.com/karpathy/ deep-vector-quantization and use the suggested schedule to attenuate the softmax temperature from 1.0 to 1 16 over the course of training. Aside from the difference in quantization, i.e. deterministic work and do not claim that improving reconstruction codebook usage, or quantization error in "Stage 1" VQ-VAE training will lead to improvements in "Stage 2" generative modeling applications.
While poor reconstruction performance will clearly lead to poor generative modeling, recent work (Yu et al., 2023) suggests that-at least for autoregressive modeling of codebook sequences with MaskGit (Chang et al., 2022)-the connection between VQ-VAE reconstruction performance and downstream generative modeling performance is non-linear. Specifically, increasing the size of the codebook past a certain amount will improve VQ-VAE reconstruction performance but make downstream likelihood-based geneative modeling of codebook vectors more difficult.
We believe this nuance may extend beyond MaskGit, and that the desiderata for likelihood-based generative models will likely be different than that for score-based generative models like diffusion.
It is even possible that different preferences appear within the same class. For example, left-to-right autoregressive modeling of codebook elements with Transformers (Vaswani, 2017) may exhibit different preferences for Stage 1 VQ-VAE models than those of MaskGit.
These topics deserve a deep, and rich, analysis that we would find difficult to include within this work as our focus is on propagating gradients through vector quantization layers. As a result, we entrust the exploration of these questions to future work.
A.12 GRADIENT ESTIMATORS AS PARALLEL TRANSPORT
In this section, we analyze the STE and the rotation trick through the lens of differential geometry, specifically as the parallel transport of the gradient ∇ q L vector from the codeword q to the encoder output e. For this analysis in this section, we only consider the rotational component R θ of the rotation trick, not the rescaling by ∥q∥ ∥e∥ .
A.12.1 BACKGROUND ON HYPERSPHERICAL COORDINATES Hyperspherical coordinate systems are ubiquitous in applications of math and physics, where certain formulas become greatly simplified by parameterizing the location of points by the radius and angles to coordinate axes. An familiar instantiation of the hyperspherical coordinate system may be polar coordinates with radial component r and polar angle θ:
x = r cos θ y = r sin θ or the instantiation of the hyperspherical coordinate system for three dimensions, otherwise known as spherical coordinates, with radial component r, polar angle θ and azimuthal angle ϕ: We outline one common conversion from Cartesian coordinates to hyperspherical coordinates below, and other conversions are equivalent up to permutation of the coordinate axes:
x 1 = r cos(θ 1 )
x 2 = r sin(θ 1 ) cos(θ 2 )
x 3 = r sin(θ 1 ) sin(θ 2 ) cos(θ 3 ) . . .
x d-1 = r sin(θ 1 ) • • • sin(θ d-2 ) cos(θ d-1 ) x d = r sin(θ 1 ) • • • sin(θ d-2 ) sin(θ d-1 )
Cartesian Coordinate System Spherical Coordinate System and the reverse transform from Cartesian coordinates to hyperspherical coordinates:
r = (x 1 ) 2 + (x 2 ) 2 + ... + (x d ) 2 θ 1 = arctan 2( (x d ) 2 + ... + (x 2 ) 2 , x 1 ) θ 2 = arctan 2( (x d ) 2 + ... + (x 3 ) 2 , x 2 )
. . .
θ d-2 = arctan 2( (x d ) 2 + (x d-1 ) 2 , x d-2 ) θ d-1 = arctan 2( (x d ) 2 , x d-1 )
where arctan 2(x, y) returns the angle measurement in radians over the support (-π, π] between between x and y.
Unlike the Cartesian coordinate system, the hyperspherical basis vectors are not identical over the entire space; they change with position. For instance, moving outwards along r will increase the length of ∂ ∂θ i as an infinitesimal change in θ i will now cover a larger arclength distance-i.e. the line segment traveled by changing the angle θ i -than that same infinitesimal change with a smaller r. This effect is visualized for three dimensions in Figure 19.
At any given point in hyperspherical coordinates p, the transformation from Cartesian basis vectors 
∂ ∂θ i = d k=1 ∂x k ∂θ i ∂ ∂x k
where ∂x i ∂θ i can be computed from the coordinate transform functions, i.e. x 1 = r cos(θ 1 ). It is typical to express these relationships in a matrix that transforms an arbitrary vector v in Cartesian coordinates at point p to its counterpart in hyperspherical coordinates ṽ at p:
∂ ∂x 1 ∂ ∂x 2 • • • ∂ ∂x d       ∂x 1 ∂r ∂x 1 ∂θ 1 • • • ∂x 1 ∂θ d-1 ∂x 2 ∂r ∂x 2 ∂θ 1 • • • ∂x 2 ∂θ d-1 . . . . . . . . . . . . ∂x d ∂r ∂x d ∂θ 1 • • • ∂x d ∂θ d-1       The Jacobian J = ∂ ∂r ∂ ∂θ 1 • • • ∂ ∂θ d-1
As illustrated in Figure 19, J does not necessarily have determinant equal to one and changes as a function of position, so the norms of the basis vectors spanning the hyperspherical tangent space change based on position. More generally, this notion of distance is captured by the line element: the length of a line segment resulting from an infinitesimal change along the coordinate axes. The Cartesian line element is given by:
ds 2 = (dx 1 ) 2 + (dx 2 ) 2 + ... + (dx d ) 2
while the hyperspherical line element is:
ds 2 = dr 2 + r 2 (dθ 1 ) 2 + r 2 sin 2 θ 1 (dθ 1 ) 2 + r 2 d-1 i=2 sin 2 θ i (dθ d-1 ) 2
which reflects that distance traveled by small changes in the hyperspherical coordinates "increases" with increasing radius and "decreases" with distance from the equator. To ensure that the norm of the basis vectors does not change during conversion, it is common to renormalize hyperspherical basis vectors to have unit norm for all points. However, a notion of norm is not defined a priori for hyperspherical vectors; the metric tensor imposed on this space defines the inner product which in turn defines a sense of arclength.
Using the induced metric from Cartesian coordinates, we can inherit the inner product from Cartesian coordinates on the hyperspherical coordinate system by expressing hyperspherical basis vectors as a linear combination of Cartesian basis vectors and then computing the norm of this resulting vector in the Cartesian tangent space:
∥ ∂ ∂θ i ∥ = ⟨ ∂ ∂θ i , ∂ ∂θ i ⟩ = d k=1 ∂x k ∂θ i ∂ ∂x k •   d j=1 ∂x j ∂θ i ∂ ∂x j   = d k=1 ∂x k ∂θ i ∂x k ∂θ i ∂ ∂x k • ∂ ∂x k = d k=1 ( ∂x k ∂θ i ) 2
The first fundamental form gives us the normalization constants:
I =        1 2 0 0 ... 0 0 r 2 0 ... 0 0 0 r 2 sin 2 θ 1 ... 0 . . . . . . . . . . . . . . . 0 0 0 ... r 2 d-1 i=1 sin 2 θ i       
as the diagonal represents the inner product ⟨ ∂ ∂θ i , ∂ ∂θ i ⟩, and we would like to renormalize each basis vector to have unit norm:
∥ ∂ ∂θ i ∥ = ⟨ ∂ ∂θ i , ∂ ∂θ i ⟩. Therefore, our normalized hyperspherical basis vectors ∂ ∂ r , ∂ ∂ θ1
, ... become:
∂ ∂ r = ∂ ∂r ∂ ∂ θi = (I ii ) -1 2 ∂ ∂θ i
Using our convention from earlier, we can now compute the transformation from Cartesian basis vectors to normalized hyperspherical basis vectors:
∂ ∂ θi = (I ii ) -1 2 d k=1 ∂x k ∂θ i ∂ ∂x k
to compose the normalized "Jacobian" Ĵ:
∂ ∂x 1 ∂ ∂x 2 • • • ∂ ∂x d        ∂x 1 ∂ r ∂x 1 ∂ θ1 • • • ∂x 1 ∂ θd-1 ∂x 2 ∂ r ∂x 2 ∂ θ1 • • • ∂x 2 ∂ θd-1 . . . . . . . . . . . . ∂x d ∂ r ∂x d ∂ θ1 • • • ∂x d ∂ θd-1        Ĵ∈SO(d) = ∂ ∂ r ∂ ∂ θ1 • • • ∂ ∂ θd-1 (1)
Rescaling the hyperspherical basis vectors to have unit norm at all points causes the matrix J to become the orthogonal matrix with determinant equal to one Ĵ. This set of d × d matrices belongs to the group SO(d), which represents the set of d-dimensional rotations about the origin. Similarly, the backwards change-of-basis Ĵ-1 = ĴT converts vectors in hyperspherical coordinates to Cartesian coordinates.
As a result, vectors from the tangent space at p in Cartesian coordinates simply rotate to convert to the normalized tangent space at p in hyperspherical coordinates. Specifically, for a point p = (r, θ 1 , θ 2 , ..., θ d-1 ) and a vector ṽ = c1 r + c2 θ 1 + ... + cd θ d-1 , converting v = c 1 x 1 + ... + c d x d from Cartesian to hyperspherical coordinates is the transformation: ṽ = ĴT v where Ĵ operates on vector v-i.e. Ĵv-by first rotating by angle c2 in the x 1 -x 2 plane (i.e. the θ 1 axis of rotation), then by angle c3 in the x 2 -x 3 plane (i.e. the θ 2 axis of rotation), so on and so forth until a final rotation by angle cd in the x d-1 -x d plane (i.e. the θ d-1 axis of rotation). Composing these rotations together leads to a rotation from p0 = (1, 0, 0, ..., 0) to p:
Ĵv = (R p0→ p)v = (R x d-1 -x d θ d • • • R x 2 -x 3 θ2 R x 1 -x 2 θ1 )v Ĵ-1 v = ĴT v = (R p0→ p) T v = (R x d-1 -x d θ d • • • R x 2 -x 3 θ2 R x 1 -x 2 θ1 ) T v = R p→ p0 v
where we define R ã→ b to be the rotation from ã to b as described above and R x i -x j θi to be the rotation by angle θ i in the x i -x j plane. Important for our later discussion on the rotation trick, this rotational characteristic causes moving a fixed vector along a curve in hyperspherical coordinates to rotate in Cartesian coordinates. Remark 4. Using the renormalized transformation in Equation (1), a constant vector field ṽ in hyperspherical coordinates corresponds to a rotated vector field in Cartesian coordinates.
Proof. At Cartesian point p and corresponding hyperspherical point p:
v T p R x d-1 -x d θ d • • • R x 2 -x 3 θ2 R x 1 -x 2 θ1 = ṽT p R x d-1 -x d θ d • • • R x 2 -x 3 θ2 R x 1 -x 2 θ1 T v p = ṽp [R p→ p0 ] v p = ṽp
so a constant vector field ṽ in hyperspherical coordinates will correspond to a cartesian vector field where each vector at point p is rotated by the rotation that alights p to p0 .
Another important characteristic relates to the metric tensor with normalized hyperspherical basis vectors. We can explicitly compute the induced metric in hyperspherical coordiantes in terms of our renormalized basis vectors:
Î =         ∂ ∂ r • ∂ ∂ r 0 0 ... 0 0 ∂ ∂ θ1 • ∂ ∂ θ1 0 ... 0 0 0 ∂ ∂ θ2 • ∂ ∂ θ2 ... 0 . . . . . . . . . . . . . . . 0 0 0 ... ∂ ∂ θd-1 • ∂ ∂ θd-1         =         (I 11 ) -1 ∂ ∂r • ∂ ∂r 0 0 ... 0 0 (I 22 ) -1 ∂ ∂θ1 • ∂ ∂θ1 0 ... 0 0 0 (I 33 ) -1 ∂ ∂θ2 • ∂ ∂θ2 ... 0 . . . . . . . . . . . . . . . 0 0 0 ... (I dd ) -1 ∂ ∂θ d-1 • ∂ ∂θ d-1         =       (I 11 ) -1 (I 11 ) 0 0 ... 0 0 (I 22 ) -1 (I 22 ) 0 ... 0 0 0 (I 33 ) -1 (I 33 ) ... 0 . . . . . . . . . . . . . . . 0 0 0 ... (I dd ) -1 (I dd )       =       1 0 0 ... 0 0 1 0 ... 0 0 0 1 ... 0 . . . . . . . . . . . . . . . 0 0 0 ... 1      (2)
which yields the identity matrix. This is perhaps unsurprising: we normalize basis vectors so that
∂ ∂ θi • ∂ ∂ θi = 1.
Another way to view the renormalized tangent plane transformation is as a change-of-basis in the Cartesian coordiante system, with the basis vectors spanning each tangent space in Cartesian coordinates rotated to align with the directions of hyperspherical basis vectors. Rotating the tangent space at each point in Cartesian coordinates does not change the Euclidean metric tensor-R T IR = I-so it remains the identity. These two formulations are equivalent: renormalizing the basis vectors in the hyperspherical tangent space to have unit norm corresponds to rotating the basis vectors in the Cartesian coordinate system to align with the hyperspherical basis vectors at all points.
A.12.2 STE AS PARALLEL TRANSPORT From the description of the STE in Bengio et al. (2013), a gradient vector ∇ q L is transported from q to e during the backwards pass in such a way that its direction and magnitude is preserved. Critically, the curve along which ∇ q L is transported is not specified; the effect is to simply "copy-and-paste" the vector from q to e.
To use the machinery of calculus, we assume that ∇ q L is transported from q to e along any smooth curve γ(t) running from q to e. Along this curve, we define the transport of ∇ q L at position γ(t) simply as ∇ q L to emulate how the STE would move ∇ q L from q to γ(t). Therefore, the direction and magnitude of ∇ q L does not change along the curve γ(t). An example of this transport is visualized in Figure 20, and in Remark 5, we show this formulation is equivalent to the parallel transport of ∇ q L along any curve γ(t) from q to e with the Levi-Civita connection. Remark 5. The Straight Through Estimator (STE) is equivalent to the parallel transport of ∇ q L along any curve connecting q to e with the identity metric tensor in Cartesian coordinates using the Levi-Civita connection. basis vectors of the tangent plane change along γ(t) to remain "parallel" along the curve:
∇ γ(t) v = ⃗ 0 Parallel Transport Condition
Using the identity metric tensor:
g ij = δ ij = 0 if i ̸ = j 1 if i = j
with the Levi-Civita connection will result in all zero Christoffel symbols:
Γ m ij = 1 2 g mk ( ∂g jk ∂x i + ∂g ik ∂x j - ∂g ij ∂x k ) = 0
where g mk is the m, k entry of inverse metric tensor. Computing the covariant derivative for a general curve γ(t): Considering the i th term in this summation:
0 = (∇ γ(t) v) i = γi ∂ ∂x i (v 1 e 1 + v 2 e 2 + ... + v d e d ) = γi ∂ ∂x i v k e k = γi ∂v k ∂x i e k + v k ∂e k ∂x i = γi ∂v k ∂x i e k + v k Γ m ik e m = γi ∂v k ∂x i e k
For this equation to hold for an arbitrary γ(t), ∂v k ∂x i = 0 for 1 ≤ k, i ≤ d. Therefore, v k must be a constant, and vector fields along curves must be constant to satisfy the parallel transport criteria.
Pulling this back to the STE, holding ∇ q L constant along the curve γ(t) from q to e results in a constant vector field along γ(t). The covariant derivative of this vector field is zero, and therefore the STE parallel transports ∇ q L from q to e.

Section: A.12.3 THE ROTATION TRICK AS PARALLEL TRANSPORT
In this section, we analyze the rotation trick through the lens of geometry. As in Appendix A.12.2, we extend the rotation trick to any smooth curve γ(t) connecting q to e and define the transport of ∇ q L at γ(t) as the rotation trick applied to move ∇ q L from q to γ(t). This definition allows us to use the structure of calculus, without imposing any prohibitive restrictions on the path taken from q to e.
To build visual intuition, Figure 21 illustrates how the rotation trick transforms an initial vector along three different curves γ 1 , γ 2 , γ 3 in both Cartesian coordinates and hyperspherical coordinates with normalized basis vectors. In Cartesian coordinates, the rotation trick changes the components of the basis vectors during transport to follow a rotation; however in normalized hyperspherical coordinates, the components of this vector during transport are constant because the basis vectors themselves rotate. Remark 6. The rotation trick is equivalent to the parallel transport of ∇ q L along any curve connecting q to e with the induced metric in hyperspherical coordinates with the normalized transformation described in Equation (1) using the Levi-Civita connection.
Proof. From Equation (2), the metric tensor in hyperspherical coordinates with normalized basis vectors-equivalently, the cartesian coordinate system with each tangent space rotated to align with the hyperspherical frame at every point-is the identity. Therefore, using the Levi-Civita connection leads to zero Christoffel symbols, and the parallel transport of a vector along any curve keeps the vector constant.
We define T p C as the tangent space of the Cartesian coordinate system at point p and T pH as the tangent space of the hyperspherical coordinate system with normalized basis vectors at point p. It remains to show that for a vector ∇ q L ∈ T q C and corresponding ∇ q L ∈ T q H, the transformation of ∇ q L ∈ T ẽH to T e C will yield R q→e ∇ q L where R q→e is the rotation trick's transformation, i.e. the rotation that rotates q to e.
For a vector ∇ q L in hyperspherical coordinates at point q = (1, θ 1 , θ 2 , ..., θ d-1 ) and using the normalized change-of-basis in Equation (1), the corresponding vector ∇ q L in Cartesian coordinates is:
∇ q L T = ∇ q LT Ĵ-1 q ∇ q L = Ĵq ∇ q L = [R p0→q ] ∇ q L = R θ d-1 R θ d-2 • • • R θ1 ∇ q L
and the corresponding vector ∇ q L at point ẽ is:
∇ e L T = ∇ q LT Ĵ-1 e ∇ e L = Ĵe ∇ q L = [R p0→ẽ ] ∇ q L = [R q→ẽ R p0→q ] ∇ q L = R q→ẽ R p0→q ∇ q L = [R q→ẽ ] ∇ q L
which is exactly how the rotation trick transforms the vector. Informally, "copy-and-pasting" the vector ∇ q L from q to ẽ in hyperspherical coordinates with normalized basis vectors corresponds to rotating ∇ q L by the rotation that aligns q to e in Cartesian coordinates.
In summary, we consider a geometry where the tangent space is spanned by unit norm basis vectors ∂ ∂ r , ∂ ∂ θ1 , ..., ∂ ∂ θd-1 that match the direction of the typical hyperspherical basis vectors ∂ ∂r , ∂ ∂θ 1 , ..., ∂ ∂ θd-1 . The induced metric tensor is the identity, so the parallel transport of a vector along any curve holds its components constant. Converting a vector ∇ q L to this tangent space via the normalized transformation in Equation ( 1), parallel transporting the resulting vector from q to ẽ, and then converting it back to Cartesian coordinates corresponds exactly to the rotation trick's transformation. This is a remarkably simple result; the rotation trick and the STE can be viewed as the same operation. Both parallel transport the gradient ∇ q L from q to e in a path-independent manner with the Euclidean metric. The only difference is the coordinate system where parallel transport occurs. The STE employs the Cartesian coordinate system while the rotation trick uses the hyperspherical coordinate system with normalized basis vectors.

Section: ACKNOWLEDGMENTS
We extend our sincere gratitude to Henry Bosch, Benjamin Spector, Dan Biderman, Jordan Juravsky, Mayee Chen, Owen Dugan, Sabri Eyuboglu, and the entire Hazy Group for their invaluable feedback, insightful discussions, and dedicated assistance throughout the revision process of this work. We gratefully acknowledge the generous support from the National Institutes of Health (NIH) under Grant No. U54EB020405 (Mobilize); the National Science Foundation (NSF) under Grant Nos. CCF2247015 (Hardware-Aware), CCF1763315 (Beyond Sparsity), CCF1563078 (Volume to Velocity), and 1937301 (RTML); the U.S. DEVCOM Army Research Laboratory (ARL) under Grant Nos. W911NF-23-2-0184 (Long-context) and W911NF-21-2-0251 (Interactive Human-AI Teaming); the Office of Naval Research (ONR) under Grant Nos. N000142312633 (Deep Signal Processing); Stanford Human-Centered Artificial Intelligence (HAI) under Grant No. 247183; and corporate partners including NXP, Xilinx, LETI-CEA, Intel, IBM, Microsoft, NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Accenture, Ericsson, Qualcomm, Analog Devices, Google Cloud, Salesforce, Total. We also thank the HAI-GCP Cloud Credits for Research program, the Stanford Data Science Initiative (SDSI), and the valued members of the Stanford DAWN project: Meta, Google, and VMWare. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views, policies, or endorsements, either expressed or implied, of NIH, ONR, or the U.S. Government.

Section: A.10.2 VQGAN EVALUATION.
Table 9: Hyperparameters for the experiments in Table 2 and Table 3. ROT is an abbreviation for the Rotation Trick. We implement the Rotation Trick in the open-source https://github.com/CompVis/taming-transformers for the experiments in Table 2 and in https://github.com/CompVis/latent-diffusion for Table 3. In both settings, we use the default hyperparameters. †: 18 epochs for ImageNet and 50 epochs for FFHQ & CelebA-HQ.

Section: A.10.3 VIT-VQGAN EVALUATION.
Table 10: Hyperparameters for the experiments in Table 4.
Section: A.10.4 TIMESFORMER-VQGAN EVALUATION.
Table 11: Hyperparameters for the experiments in Table 5. We also visualize the reconstructions for the TimeSformer-VQGAN trained with the Rotation Trick and the STE. Figure 17 shows the reconstructions for BAIR Robot Pushing, and Figure 18 shows the reconstructions for UCF101. In contrast to STE-trained models, which often exhibit training instability and codebook collapse, VQ-VAEs trained with the Rotation Trick do not manifest these issues. Instead, codebook usage is relatively high—at 43% for BAIR Robot Pushing and 30% for UCF101—and the reconstructions accurately match the input, even with relatively small video models for both encoder and decoder.

Section: Original Video Rotation Trick Reconstructions STE Reconstructions

Section: A.10.5 CREATION OF VORONOI REGION FIGURE
In this section, we describe the creation of Figure 4 as well as the other figures that use this format. For the top-right and bottom partitions, we fix the codebook to a set of preset values and sample pre-quantized points from four different Gaussian distributions. For the pre-quantized points in the top-left partition, we manually set them to form a crescent shape around the codeword. We similarly fix constant gradient vectors for each partition and apply them to the pre-quantized points after transformation by the STE (i.e., simply moving the gradient to each pre-quantized point in the quantized region) or by the Rotation Trick (i.e., rotating the gradient based on the angle between the pre-quantized point and closest codebook vector and rescaling appropriately). We multiply the gradient by a small constant (the learning rate) and then apply the gradient to each pre-quantized point. We repeat the above 25 times, at each point re-computing the angle and magnitude between the pre-quantized point and the codebook vector for the Rotation Trick update. For simplicity, we do not update the codebook vectors themselves or recompute codebook regions throughout the numerical simulation.

Section: Original Video Rotation Trick Reconstructions STE Reconstructions
A.11 COMPARISON WITHIN GENERATIVE MODELING APPLICATIONS Absent from our work is an analysis on the effect of VQ-VAEs trained with the Rotation Trick on downstream generative modeling applications. We see this comparison as outside the scope of this


References:
[b0] Alexei Baevski; Steffen Schneider; Michael Auli (2019). vq-wav2vec: Self-supervised learning of discrete speech representations. 
[b1] Jonathan Baxter (2000). A model of inductive bias learning. Journal of artificial intelligence research
[b2] Yoshua Bengio; Nicholas Léonard; Aaron Courville (2013). Estimating or propagating gradients through stochastic neurons for conditional computation. 
[b3] Gedas Bertasius; Heng Wang; Lorenzo Torresani (2021). Is space-time attention all you need for video understanding. ICML
[b4] Tim Brooks; Bill Peebles; Connor Holmes; Will Depue; Yufei Guo; Li Jing; David Schnurr; Joe Taylor; Troy Luhman; Eric Luhman; Clarence Ng; Ricky Wang; Aditya Ramesh (2024). Video generation models as world simulators. 
[b5] Huiwen Chang; Han Zhang; Lu Jiang; Ce Liu; William T Freeman (2022). Maskgit: Masked generative image transformer. 
[b6] Hang Chen; Sankepally Sainath Reddy; Ziwei Chen; Dianbo Liu (2024). Balance of number of embedding and their dimensions in vector quantization. 
[b7] Vijay Zhao Chen; Chen-Yu Badrinarayanan; Andrew Lee;  Rabinovich (2018). Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. PMLR
[b8] Chung-Cheng Chiu; James Qin; Yu Zhang; Jiahui Yu; Yonghui Wu (2022). Self-supervised learning with random-projection quantizer for speech recognition. PMLR
[b9] M Thomas;  Cover (1999). Elements of information theory. John Wiley & Sons
[b10] Mathieu Dagréou; Pierre Ablin; Samuel Vaiter; Thomas Moreau (2024). How to compute hessianvector products? In ICLR Blogposts. 
[b11] Jia Deng; Wei Dong; Richard Socher; Li-Jia Li; Kai Li; Li Fei-Fei (2009). Imagenet: A large-scale hierarchical image database. Ieee
[b12] Prafulla Dhariwal; Heewoo Jun; Christine Payne; Jong Wook Kim; Alec Radford; Ilya Sutskever (2020). Jukebox: A generative model for music. 
[b13] Xiaoyi Dong; Jianmin Bao; Ting Zhang; Dongdong Chen; Weiming Zhang; Lu Yuan; Dong Chen; Fang Wen; Nenghai Yu; Baining Guo (2023). Peco: Perceptual codebook for bert pre-training of vision transformers. 
[b14] Alexey Dosovitskiy (2020). An image is worth 16x16 words: Transformers for image recognition at scale. 
[b15] Frederik Ebert; Chelsea Finn; Alex X Lee; Sergey Levine (2017). Self-supervised visual planning with temporal skip connections. CoRL
[b16] Patrick Esser; Robin Rombach; Bjorn Ommer (2021). Taming transformers for high-resolution image synthesis. 
[b17] Tanmay Gautam; Reid Pryzant; Ziyi Yang; Chenguang Zhu; Somayeh Sojoudi (2023). Soft convex quantization: Revisiting vector quantization with convex optimization. 
[b18] Nabarun Goswami; Yusuke Mukuta; Tatsuya Harada (2024). Hypervq: Mlr-based vector quantization in hyperbolic space. 
[b19] Robert Gray (1984). Vector quantization. IEEE Assp Magazine
[b20] Martin Heusel; Hubert Ramsauer; Thomas Unterthiner; Bernhard Nessler; Sepp Hochreiter (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems
[b21] Geoffrey E Hinton; Richard Zemel (1993). Autoencoders, minimum description length and helmholtz free energy. Advances in neural information processing systems
[b22] Mengqi Huang; Zhendong Mao; Zhuowei Chen; Yongdong Zhang (2023). Towards accurate image coding: Improved autoregressive image generation with dynamic vector quantization. 
[b23] Minyoung Huh; Brian Cheung; Pulkit Agrawal; Phillip Isola (2023). Straightening out the straightthrough estimator: Overcoming optimization challenges in vector quantized networks. PMLR
[b24] Phillip Isola; Jun-Yan Zhu; Tinghui Zhou; Alexei A Efros (2017). Image-to-image translation with conditional adversarial networks. 
[b25] Eric Jang; Shixiang Gu; Ben Poole (2016). Categorical reparameterization with gumbel-softmax. 
[b26] Justin Johnson; Alexandre Alahi; Li Fei-Fei (2016). Perceptual losses for real-time style transfer and super-resolution. Springer
[b27] Tero Karras (2017). Progressive growing of gans for improved quality, stability, and variation. 
[b28] Tero Karras; Samuli Laine; Timo Aila (2019). A style-based generator architecture for generative adversarial networks. 
[b29] Alex Kendall; Yarin Gal; Roberto Cipolla (2018). Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. 
[b30] P Diederik; Max Kingma;  Welling (2013). Auto-encoding variational bayes. 
[b31] Alexander Kolesnikov; André Susano Pinto; Lucas Beyer; Xiaohua Zhai; Jeremiah Harmsen; Neil Houlsby (2022). Uvim: A unified modeling approach for vision with learned guiding codes. Advances in Neural Information Processing Systems
[b32] Adrian Łańcucki; Jan Chorowski; Guillaume Sanchez; Ricard Marxer; Nanxin Chen; Jga Hans; Sameer Dolfing; Tanel Khurana; Antoine Alumäe;  Laurent (2020). Robust training of vector quantized bottleneck models. IEEE
[b33] Doyup Lee; Chiheon Kim; Saehoon Kim; Minsu Cho; Wook-Shin Han (2022). Autoregressive image generation using residual quantization. 
[b34] Fabian Mentzer; David Minnen; Eirikur Agustsson; Michael Tschannen (2023). Finite scalar quantization: Vq-vae made simple. 
[b35] Robin Rombach; Andreas Blattmann; Dominik Lorenz; Patrick Esser; Björn Ommer (2022). Highresolution image synthesis with latent diffusion models. 
[b36] Tim Salimans; Ian Goodfellow; Wojciech Zaremba; Vicki Cheung; Alec Radford; Xi Chen (2016). Improved techniques for training gans. Advances in neural information processing systems
[b37]  Soomro (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. 
[b38] Yuhta Takida; Takashi Shibuya; Weihsiang Liao; Chieh-Hsin Lai; Junki Ohmura; Toshimitsu Uesaka; Naoki Murata; Shusuke Takahashi; Toshiyuki Kumakura; Yuki Mitsufuji (2022). Sq-vae: Variational bayes on discrete representation with self-annealed stochastic quantization. 
[b39] Thomas Unterthiner; Sjoerd Van Steenkiste; Karol Kurach; Raphael Marinier; Marcin Michalski; Sylvain Gelly (2018). Towards accurate generative models of video: A new metric & challenges. 
[b40] Aaron Van Den; Oriol Oord;  Vinyals (2017). Neural discrete representation learning. Advances in neural information processing systems
[b41]  Vaswani (2017). Attention is all you need. 
[b42] Wilson Yan; Yunzhi Zhang; Pieter Abbeel; Aravind Srinivas (2021). Videogpt: Video generation using vq-vae and transformers. 
[b43] Jiahui Yu; Xin Li; Jing Yu Koh; Han Zhang; Ruoming Pang; James Qin; Alexander Ku; Yuanzhong Xu; Jason Baldridge; Yonghui Wu (2021). Vector-quantized image modeling with improved vqgan. 
[b44] Lijun Yu; José Lezama; B Nitesh; Luca Gundavarapu; Kihyuk Versari; David Sohn; Yong Minnen; Agrim Cheng; Xiuye Gupta; Alexander G Gu;  Hauptmann (2023). Language model beats diffusiontokenizer is key to visual generation. 
[b45] Jiahui Zhang; Fangneng Zhan; Christian Theobalt; Shijian Lu (2023). Regularized vector quantization for tokenized image synthesis. 
[b46] Yue Zhao; Yuanjun Xiong; Philipp Krähenbühl (2024). Image and video tokenization with binary spherical quantization. 
[b47] Chuanxia Zheng; Andrea Vedaldi (2023). Online clustered codebook. 
[b48] Zixin Zhu; Xuelu Feng; Dongdong Chen; Jianmin Bao; Le Wang; Yinpeng Chen; Lu Yuan; Gang Hua (2023). Designing a better asymmetric vqgan for stablediffusion. 

Figures:
Figure fig_0: 2
Type: figure
Caption: Figure 2 :2Figure 2: Visualization of how the straight-through estimator (STE) transforms the gradient field for 16 codebook vectors for (top) f (x, y) = x 2 + y 2 and (bottom) f (x, y) = log | 1 2 x + tanh(y)| . The STE takes the gradient at the codebook vector (qx, qy) and "copies-and-pastes" it to all other locations within the same codebook region, forming a "checker-board" pattern in the gradient field. function Q(•). We can break down the backward pass into three terms: ∂L ∂x = ∂L ∂q ∂q ∂e ∂e ∂x
Data: 

Figure fig_1: 3
Type: figure
Caption: Figure 3 :3Figure 3: Illustration of how the gradient at q
Data: 

Figure fig_2: 4
Type: figure
Caption: Figure 4 :4Figure 4: Depiction of how points within the same codebook region change after a gradient update (red arrow) at the codebook vector (orange circle). The STE applies the same update to each point in the same region. The rotation trick modifies the update based on the location of each point with respect to the codebook vector.
Data: 

Figure fig_3: 5
Type: figure
Caption: Figure 5 :5Figure 5: With the STE, the distances among points within the same region do not change. However with the rotation trick, the distances among points do change. When ϕ < π/2, points with large angular distance are pushed away (blue: increasing distance). When ϕ > π/2, points are pulled towards the codebook vector (green: decreasing distance).
Data: 

Figure fig_4: 7
Type: figure
Caption: Figure 7 :7Figure 7: Loss surface for Himmelblau's function. Himmelblau's function has four equal local minima: f (3.0, 2.0) = 0.0, f (-2.8.., 3.1...) = 0.0, f (-3.7.., -3.2..) = 0.0, and f (3.5.., -1.8..) = 0.0.
Data: 

Figure fig_5: 8
Type: figure
Caption: Figure 88Figure 8 visualizes our results after 33, 66, and 100 gradient updates. The orange circles represent codebook vectors, the green dots the initial points, and the blue dots the updated points. Contour lines are drawn in each diagram to indicate regions of equal loss, with blue representing regions of low loss and red indicating regions of high loss.Similar to our findings in Section 5, we see that the rotation trick clusters points more tightly around each codebook vector when compared to the STE, resulting in lower distortion. Moreover, the codebook vectors more rapidly converge to the four equal local minima in Himmelblau's function, resulting in a lower objective function value when averaged across all points.
Data: 

Figure fig_6: 8
Type: figure
Caption: Figure 8 :8Figure8: Synthetic experiment for minimizing Himmelblau's function with vector quantization using the STE gradient estimator (top row) and the rotation trick (bottom row). The rotation trick more quickly converges to these minima and achieves substantively lower distortion between codewords and pre-quantized points.
Data: 

Figure fig_7: 9
Type: figure
Caption: Figure 9 :9Figure 9: Examples of how the gradient can change due to the presence of negative curvature or an indefinite
Data: 

Figure fig_8: 10
Type: figure
Caption: Figure 10 :10Figure 10: Depiction of how points within the same codebook region change after a gradient update (red arrow)
Data: 

Figure fig_9: 1
Type: figure
Caption: Remark 1 .1Let a, b ∈ R d that define hyperplanes a ⊥ and b ⊥ respectively. Then a reflection across a ⊥ followed by a reflection across b ⊥ is a rotation of 2θ in the plane spanned by a, b where θ is the angle between a, b. Remark 2. Let a, b ∈ R d with ∥a∥ = ∥b∥ = 1. Define c = a+b ∥a+b∥ as the vector half-way between a and b so that ∠(a, b) = θ and ∠(a, c) = ∠(b, c) = θ 2 . From Definition 1, (I -2cc T ) encodes a reflection across c ⊥ and (I -2bb T ) encodes a reflection across b ⊥ . From Remark 1, (I -2bb T )(I -2cc T ) then corresponds to a rotation of 2( θ 2 ) = θ in the plane spanned by b and c. As the span(b, c) = span(a, b), (I -2bb T )(I -2cc T ) corresponds to a rotation of θ in the plane spanned by a and b. Therefore, (I -2bb T )(I -2cc T )a = b.
Data: 

Figure fig_11: 13
Type: figure
Caption: Figure 13 :13Figure 13: Depiction of how points within the same codebook region change after a gradient update (red arrow) at the codebook vector (orange circle). The STE applies the same update to each point in the same region. The reflection trick (Appendix A.8) modifies the update based on the location of each point with respect to the codebook vector. Note the top-left region of the reflection trick update, where the points actually move in the opposite direction of the gradient update.
Data: 

Figure fig_12: 
Type: figure
Caption: x= r cos θ y = r sin θ cos ϕ z = r sin θ sin ϕ More generally, hyperspherical coordinates are composed by a radial coordinate r and d -1 angular coordinates θ 1 , ..., θ d-1 where θ 1 , ...θ d-2 are supported over [0, π] while θ d -1 ranges from [0, 2π].
Data: 

Figure fig_13: 19
Type: figure
Caption: Figure 19 :19Figure 19: Visualization of basis vectors at different points under Cartesian (left) and spherical (right) coordinatate systems. Notice that the Cartesian basis vectors do not change from point-to-point; however, the spherical basis vectors change in both direction and magnitude. Even at the same radius, the ∂ ∂ϕ coordinate changes based on the azimuth angle θ because the same infinitesimal change in ϕ will result in a longer (or smaller) change in arclength depending on the radius of the circle at latitude θ.
Data: 

Figure fig_14: 
Type: figure
Caption: Figure 20: (top) Visualization of vector transport in Cartesian coordinates and renormalized hyperspherical coordinates along curves γ1(t), γ2(t) and γ3(t). Notice the hyperspherical basis changes from point to point. (bottom) Depiction of the transported vector in terms of the basis vectors ∂ ∂x and ∂ ∂y for Cartesian coordinates and ∂ ∂ r and ∂ ∂ θ for hyperspherical coordinates. Notice how the components of ∂ ∂ r and ∂ ∂ θ change for a constant vector field in the Cartesian tangent space.
Data: 

Figure fig_15: 
Type: figure
Caption: ∇γ(t) v = ∇ γ1e1+ γ2e2+...+ γd e d v v 1 e 1 + v 2 e 2 + ... + v d e d )must be equal to 0 for parallel transport
Data: 

Figure fig_16: 
Type: figure
Caption: Figure 21: (top) Visualization of vector transport in hyperspherical coordinates with normalized basis vectors and Cartesian coordinates along curves γ1(t), γ2(t) and γ3(t). The vectors along each curve in hyperspherical coordinates rotate to stay constant with respect to the natural rotation of the basis vectors. This same rotation in Cartesian coordinates yields a non-constant vector as the Cartesian basis vectors do not change from point to point. (bottom) Depiction of the transported vector in terms of the basis vectors ∂ ∂ r and ∂ ∂ θ for hyperspherical coordinates and ∂ ∂x and ∂ ∂y for Cartesian coordinates.In the former case, the transported vector remains constant with respect to the normalized basis vectors, while in Cartesian coordinates, the components change along γ3(t).
Data: 

Figure tab_0: 1
Type: table
Caption: Comparison of VQ-VAEs trained on ImageNet followingVan Den Oord et al. (2017). We use the Vector Quantization layer from https://github.com/lucidrains/vector-quantize-pytorch.
Data: ApproachTraining MetricsValidation MetricsVQ-VAE100%0.1075.9e-30.115106.111.7VQ-VAE w/ Rotation Trick97%0.1165.1e-40.12285.717.0Codebook Lookup: Cosine & Latent Shape: 32 × 32 × 32 & Codebook Size: 1024VQ-VAE75%0.1072.9e-30.11484.317.7VQ-VAE w/ Rotation Trick91%0.1052.7e-30.11182.918.1Codebook Lookup: Euclidean & Latent Shape: 64 × 64 × 3 & Codebook Size: 8192VQ-VAE100%0.0281.0e-30.03019.097.3Gumbel VQ-VAE39%0.054-0.05828.674.9VQ-VAE w/ Hessian Approx.39%0.0826.9e-50.11235.665.1VQ-VAE w/ Exact Gradients84%0.0502.0e-30.05325.480.4VQ-VAE w/ Rotation Trick99%0.0281.4e-40.03016.5106.3Codebook Lookup: Cosine & Latent Shape: 64 × 64 × 3 & Codebook Size: 8192VQ-VAE31%0.0341.2e-40.03826.077.8VQ-VAE w/ Hessian Approx.37%0.0353.8e-50.03729.071.5VQ-VAE w/ Exact Gradients38%0.0353.6e-50.03728.275.0VQ-VAE w/ Rotation Trick38%0.0339.6e-50.03524.283.9

Figure tab_1: 2
Type: table
Caption: Results for VQGAN designed for autoregressive generation as implemented in https://github.com/ CompVis/taming-transformers. Experiments on ImageNet and the combined dataset FFHQ(Karras et al., 2019) and CelebA-HQ(Karras, 2017) use a latent bottleneck of dimension 16 × 16 × 256 with 1024 codebook vectors.
Data: ApproachDatasetCodebook Usage Quantization Error (↓) Valid Loss (↓) r-FID (↓) r-IS (↑)VQGAN (reported)ImageNet---7.9114.4VQGAN (our run)ImageNet95%0.1340.5947.3118.2VQGAN w/ Rotation TrickImageNet98%0.0020.4224.6146.5VQGANFFHQ & CelebA-HQ27%0.2330.5654.75.0VQGAN w/ Rotation Trick FFHQ & CelebA-HQ99%0.0020.3133.75.2

Figure tab_2: 3
Type: table
Caption: Results for VQGAN designed for latent diffusion as implemented in https://github.com/CompVis/ latent-diffusion. Both settings train on ImageNet.
Data: ApproachLatent Shape Codebook Size Codebook Usage Quantization Error (↓) Valid Loss (↓) r-FID (↓) r-IS (↑)VQGAN64 × 64 × 3819215%2.5e-30.1830.53220.6Gumbel VQGAN64 × 64 × 381924%-0.1970.60219.7VQGAN w/ Rotation Trick 64 × 64 × 3819286%1.7e-40.1420.27228.0VQGAN32 × 32 × 4163842%1.2e-20.3855.0141.5Gumbel VQGAN32 × 32 × 41638412%-0.30311.7189.5VQGAN w/ Rotation Trick 32 × 32 × 41638427%2.4e-40.2691.1200.2

Figure tab_3: 4
Type: table
Caption: Results for ViT-VQGAN(Yu et al., 2021) trained on ImageNet. The latent shape is 8 × 8 × 32 with 8192 codebook vectors. r-FID and r-IS are reported on the validation set.
Data: ApproachCodebook Usage (↑) Train Loss (↓) Quantization Error (↓) Valid Loss (↓) r-FID (↓) r-IS (↑)ViT-VQGAN [reported]----22.872.9ViT-VQGAN [ours]0.3%0.1246.7e-30.12729.243.0ViT-VQGAN w/ Rotation Trick2.2%0.1138.3e-30.11311.293.1

Figure tab_4: 
Type: table
Caption: rather than CNN to parameterize the encoder and decoder. The ViT-VQGAN uses factorized codes and L 2 normalization on the output and input to the vector quantization layer to improve performance and training stability. Additionally, the authors change the training objective, adding a logit-laplace loss and restoring the L 2 reconstruction error to L VQGAN .Experimental Settings. We follow the open source implementation of https://github.com/thuanz123/ enhancing-transformers and use the default model and hyperparameter settings for the small ViT-VQGAN. A complete description of the training settings can be found in Table10of the Appendix.
Data: 

Figure tab_5: 5
Type: table
Caption: Results for TimeSformer-VQGAN trained on BAIR and UCF-101 with 1024 codebook vectors. †: model suffers from codebook collapse and diverges. r-FVD is computed on the validation set.
Data: ApproachDatasetCodebook Usage Train Loss (↓) Quantization Error (↓) Valid Loss (↓) r-FVD (↓)TimeSformer  †BAIR0.4%0.2210.030.281661.1TimeSformer w/ Rotation TrickBAIR43%0.0743.0e-30.07421.4TimeSformer  †UCF-1010.1%0.1900.0060.1692878.1TimeSformer w/ Rotation Trick UCF-10130%0.1110.0200.109229.1

Figure tab_6: 
Type: table
Caption: video model. Due to compute limitations, both encoder and decoder follow a relatively small TimeSformer model: 8 layers, 256 hidden dimensions, 4 attention heads, and 768 MLP hidden dimensions. A complete description of the architecture, training settings, and hyperparameters are provided in Appendix A.10.4.
Data: 

Figure tab_8: 
Type: table
Caption: Rotation Trick Function Latent Shape Codebook Size Codebook Usage Quantization Error (↓) Valid Loss (↓) r-FID (↓) r-IS (↑)
Data: ∥q∥ ∥e∥ Re64 × 64 × 3819245%4.0e-40.1610.46225.0Re -(q -Re)64 × 64 × 3819228%1.5e-30.1830.6220.0∥q∥ ∥e∥ Re32 × 32 × 41638418%3.3e-40.2921.5196.1Re -(q -Re)32 × 32 × 41638413%9.4e-40.2921.5191.5


Formulas:
Formula formula_0: Q(q = i|e) = 1 if i = arg min 1≤j≤|C| ∥e -q j ∥ 2 0 otherwise

Formula formula_1: L(x) = ∥x -x∥ 2 2 + ∥sg(e) -q∥ 2 2 + β∥e -sg(q)∥ 2 2

Formula formula_2: ∂L ∂x = ∂L ∂q I ∂e ∂x

Formula formula_3: L e ≈ L q + (∇ q L) T (e -q) + 1 2 (e -q) T (∇ 2 q L)(e -q)

Formula formula_4: ∂L ∂e ≈ ∂ ∂e L q + (∇ q L) T (e -q) + 1 2 (e -q) T (∇ 2 q L)(e -q) = ∇ q L + (∇ 2 q L)(e -q)

Formula formula_5: ∂ q ∂e = ∥q∥ ∥e∥ R

Formula formula_6: q = λRe = λ(I -2rr T + 2qê T )e = λ[e -2rr T e + 2qê T e]

Formula formula_7: L = ∥x -x∥ 2 2 + ∥sg(e) -q∥ 2 2 + β∥e -sg(q)∥ 2 2

Formula formula_8: L VQGAN = L Per + ∥sg(e) -q∥ 2 2 + β∥e -sg(q)∥ 2 2 + λL Adv

Formula formula_9: {L e } Hessian = L q + (∇ q L) T (e -q) + 1 2 (e -q) T (∇ 2 q L)(e -q)

Formula formula_10: ∂L e ∂e -{ ∂L e ∂e } Hessian = ∂ ∂e O(∥e -q∥ 3 )

Formula formula_11: ∥q -e∥ 2 = ∥q∥ 2 + ∥e∥ 2 -2∥q∥∥e∥ cos(θ)

Formula formula_12: ∥q -ê∥ 2 = ∥q -e∥ 2 = ∥q∥ 2 + ∥ê∥ 2 -2∥q∥∥ê∥ cos θ

Formula formula_13: cos θ = ∥q∥ 2 + ∥e∥ 2 -2∥q∥∥e∥ cos(θ) -∥q + d∥ 2 -∥e + d∥ 2 -2∥q + d∥∥e + d∥

Formula formula_14: cos θ ≈ -2∥d∥ 2 -2∥d∥ 2 = 1

Formula formula_15: R = (I -2qq T )(I -2rr T ) = I -2qq T -2rr T + 4qq T rr T = I -2qq T -2rr T + 4q q T r r T = I -2qq T -2rr T + 4q q T e + q ∥e + q∥ r T = I -2qq T -2rr T 4q q T e + q T q ∥e + q∥ r T = I -2qq T -2rr T + 4q ∥q∥∥e∥ cos θ + ∥q∥∥q∥ ∥e + q∥ r T = I -2qq T -2rr T + 4q cos θ + 1 ∥e + q∥ r T = I -2qq T -2rr T + 4q ∥e + q∥ 2 2∥e + q∥ r T = I -2qq T -2rr T + 4∥e + q∥ 2 2∥e + q∥ qr T = I -2qq T -2rr T + 4∥e + q∥ 2 2∥e + q∥ 2 q(e + q) T = I -2qq T -2rr T + 2qe T + 2qq T = I -2rr T + 2qe T

Formula formula_16: q = eR T ∂q ∂e = R

Formula formula_17: ∇ e L = ∇ q L ∂q ∂e = ∇ q L [R]

Formula formula_18: ∥∇ q L∥ cos θ = q [∇ q L] T = eR T [∇ q L] T = e [∇ q LR] T = e [∇ e L] T

Formula formula_19: q = ||q|| ||e|| Re

Formula formula_20: f (e) = ∥Q(e)∥ ∥e∥ I -2 e + Q(e) ∥e + Q(e)∥ e + Q(e) ∥e + Q(e)∥ T + 2Q(e)e T = ∥q∥ ∥e∥ R

Formula formula_21: q = f (e)e

Formula formula_22: ∂ q ∂e = f ′ (e)e + f (e)

Formula formula_23: q = R constant e + (q -Re) constant

Formula formula_24: ∥e -q∥ = ∥e∥ 2 + ∥q∥ 2 -2∥e∥∥q∥ cos θ

Formula formula_25: q = γ(e)Re + (q -γ(e)Re)

Formula formula_26: γ(e) = 1 8∥q -e∥ 2

Formula formula_27: x d-1 = r sin(θ 1 ) • • • sin(θ d-2 ) cos(θ d-1 ) x d = r sin(θ 1 ) • • • sin(θ d-2 ) sin(θ d-1 )

Formula formula_28: r = (x 1 ) 2 + (x 2 ) 2 + ... + (x d ) 2 θ 1 = arctan 2( (x d ) 2 + ... + (x 2 ) 2 , x 1 ) θ 2 = arctan 2( (x d ) 2 + ... + (x 3 ) 2 , x 2 )

Formula formula_29: θ d-2 = arctan 2( (x d ) 2 + (x d-1 ) 2 , x d-2 ) θ d-1 = arctan 2( (x d ) 2 , x d-1 )

Formula formula_30: ∂ ∂θ i = d k=1 ∂x k ∂θ i ∂ ∂x k

Formula formula_31: ∂ ∂x 1 ∂ ∂x 2 • • • ∂ ∂x d       ∂x 1 ∂r ∂x 1 ∂θ 1 • • • ∂x 1 ∂θ d-1 ∂x 2 ∂r ∂x 2 ∂θ 1 • • • ∂x 2 ∂θ d-1 . . . . . . . . . . . . ∂x d ∂r ∂x d ∂θ 1 • • • ∂x d ∂θ d-1       The Jacobian J = ∂ ∂r ∂ ∂θ 1 • • • ∂ ∂θ d-1

Formula formula_32: ds 2 = (dx 1 ) 2 + (dx 2 ) 2 + ... + (dx d ) 2

Formula formula_33: ds 2 = dr 2 + r 2 (dθ 1 ) 2 + r 2 sin 2 θ 1 (dθ 1 ) 2 + r 2 d-1 i=2 sin 2 θ i (dθ d-1 ) 2

Formula formula_34: ∥ ∂ ∂θ i ∥ = ⟨ ∂ ∂θ i , ∂ ∂θ i ⟩ = d k=1 ∂x k ∂θ i ∂ ∂x k •   d j=1 ∂x j ∂θ i ∂ ∂x j   = d k=1 ∂x k ∂θ i ∂x k ∂θ i ∂ ∂x k • ∂ ∂x k = d k=1 ( ∂x k ∂θ i ) 2

Formula formula_35: I =        1 2 0 0 ... 0 0 r 2 0 ... 0 0 0 r 2 sin 2 θ 1 ... 0 . . . . . . . . . . . . . . . 0 0 0 ... r 2 d-1 i=1 sin 2 θ i       

Formula formula_36: ∥ ∂ ∂θ i ∥ = ⟨ ∂ ∂θ i , ∂ ∂θ i ⟩. Therefore, our normalized hyperspherical basis vectors ∂ ∂ r , ∂ ∂ θ1

Formula formula_37: ∂ ∂ r = ∂ ∂r ∂ ∂ θi = (I ii ) -1 2 ∂ ∂θ i

Formula formula_38: ∂ ∂ θi = (I ii ) -1 2 d k=1 ∂x k ∂θ i ∂ ∂x k

Formula formula_39: ∂ ∂x 1 ∂ ∂x 2 • • • ∂ ∂x d        ∂x 1 ∂ r ∂x 1 ∂ θ1 • • • ∂x 1 ∂ θd-1 ∂x 2 ∂ r ∂x 2 ∂ θ1 • • • ∂x 2 ∂ θd-1 . . . . . . . . . . . . ∂x d ∂ r ∂x d ∂ θ1 • • • ∂x d ∂ θd-1        Ĵ∈SO(d) = ∂ ∂ r ∂ ∂ θ1 • • • ∂ ∂ θd-1 (1)

Formula formula_40: Ĵv = (R p0→ p)v = (R x d-1 -x d θ d • • • R x 2 -x 3 θ2 R x 1 -x 2 θ1 )v Ĵ-1 v = ĴT v = (R p0→ p) T v = (R x d-1 -x d θ d • • • R x 2 -x 3 θ2 R x 1 -x 2 θ1 ) T v = R p→ p0 v

Formula formula_41: v T p R x d-1 -x d θ d • • • R x 2 -x 3 θ2 R x 1 -x 2 θ1 = ṽT p R x d-1 -x d θ d • • • R x 2 -x 3 θ2 R x 1 -x 2 θ1 T v p = ṽp [R p→ p0 ] v p = ṽp

Formula formula_42: Î =         ∂ ∂ r • ∂ ∂ r 0 0 ... 0 0 ∂ ∂ θ1 • ∂ ∂ θ1 0 ... 0 0 0 ∂ ∂ θ2 • ∂ ∂ θ2 ... 0 . . . . . . . . . . . . . . . 0 0 0 ... ∂ ∂ θd-1 • ∂ ∂ θd-1         =         (I 11 ) -1 ∂ ∂r • ∂ ∂r 0 0 ... 0 0 (I 22 ) -1 ∂ ∂θ1 • ∂ ∂θ1 0 ... 0 0 0 (I 33 ) -1 ∂ ∂θ2 • ∂ ∂θ2 ... 0 . . . . . . . . . . . . . . . 0 0 0 ... (I dd ) -1 ∂ ∂θ d-1 • ∂ ∂θ d-1         =       (I 11 ) -1 (I 11 ) 0 0 ... 0 0 (I 22 ) -1 (I 22 ) 0 ... 0 0 0 (I 33 ) -1 (I 33 ) ... 0 . . . . . . . . . . . . . . . 0 0 0 ... (I dd ) -1 (I dd )       =       1 0 0 ... 0 0 1 0 ... 0 0 0 1 ... 0 . . . . . . . . . . . . . . . 0 0 0 ... 1      (2)

Formula formula_43: ∂ ∂ θi • ∂ ∂ θi = 1.

Formula formula_44: ∇ γ(t) v = ⃗ 0 Parallel Transport Condition

Formula formula_45: g ij = δ ij = 0 if i ̸ = j 1 if i = j

Formula formula_46: Γ m ij = 1 2 g mk ( ∂g jk ∂x i + ∂g ik ∂x j - ∂g ij ∂x k ) = 0

Formula formula_47: 0 = (∇ γ(t) v) i = γi ∂ ∂x i (v 1 e 1 + v 2 e 2 + ... + v d e d ) = γi ∂ ∂x i v k e k = γi ∂v k ∂x i e k + v k ∂e k ∂x i = γi ∂v k ∂x i e k + v k Γ m ik e m = γi ∂v k ∂x i e k

Formula formula_48: ∇ q L T = ∇ q LT Ĵ-1 q ∇ q L = Ĵq ∇ q L = [R p0→q ] ∇ q L = R θ d-1 R θ d-2 • • • R θ1 ∇ q L

Formula formula_49: ∇ e L T = ∇ q LT Ĵ-1 e ∇ e L = Ĵe ∇ q L = [R p0→ẽ ] ∇ q L = [R q→ẽ R p0→q ] ∇ q L = R q→ẽ R p0→q ∇ q L = [R q→ẽ ] ∇ q L
