Title: RESTRUCTURING VECTOR QUANTIZATION WITH THE ROTATION TRICK

Abstract: Vector Quantized Variational AutoEncoders (VQ-VAEs) are designed to compress a continuous input to a discrete latent space and reconstruct it with minimal distortion. They operate by maintaining a set of vectors-often referred to as the codebook-and quantizing each encoder output to the nearest vector in the codebook. However, as vector quantization is non-differentiable, the gradient to the encoder flows around the vector quantization layer rather than through it in a straight-through approximation. This approximation may be undesirable as all information from the vector quantization operation is lost. In this work, we propose a way to propagate gradients through the vector quantization layer of VQ-VAEs. We smoothly transform each encoder output into its corresponding codebook vector via a rotation and rescaling linear transformation that is treated as a constant during backpropagation. As a result, the relative magnitude and angle between encoder output and codebook vector becomes encoded into the gradient as it propagates through the vector quantization layer and back to the encoder. Across 11 different VQ-VAE training paradigms, we find this restructuring improves reconstruction metrics, codebook utilization, and quantization error. Our code is available at https://github.com/cfifty/rotation_trick.

Section: INTRODUCTION
Vector quantization (Gray, 1984) is an approach to discretize a continuous vector space. It defines a finite set of vectors-referred to as the codebook-and maps any vector in the continuous vector space to the closest vector in the codebook. However, deep learning paradigms that use vector quantization are often difficult to train because replacing a vector with its closest codebook counterpart is a nondifferentiable operation (Huh et al., 2023). This characteristic was not an issue at its creation during the Renaissance of Information Theory for applications like noisy channel communication (Cover, 1999); however in the era deep learning, it presents a challenge as gradients cannot directly flow through layers that use vector quantization during backpropagation.
In deep learning, vector quantization is largely used in the eponymous Vector Quantized-Variational AutoEncoder (VQ-VAE) (Van Den Oord et al., 2017). A VQ-VAE is an AutoEncoder with a vector quantization layer between the encoder's output and decoder's input, thereby quantizing the learned representation at the bottleneck. While VQ-VAEs are ubiquitous in state-of-the-art generative modeling (Rombach et al., 2022;Dhariwal et al., 2020;Brooks et al., 2024), their gradients cannot flow from the decoder to the encoder uninterrupted as they must pass through a non-differentiable vector quantization layer.
A solution to the non-differentiability problem is to approximate gradients via a "straight-through estimator" (STE) (Bengio et al., 2013). During backpropagation, the STE copies and pastes the gradients from the decoder's input to the encoder's output, thereby skipping the quantization operation altogether. However, this approximation can lead to poor-performing models and codebook collapse: a phenomena where a large percentage of the codebook converge to zero norm and are unused by the model (Mentzer et al., 2023). Even if codebook collapse does not occur, the codebook is often under-utilized, thereby limiting the information capacity of the VQ-VAEs's bottleneck (Dhariwal et al., 2020). instabilities caused by the vector quantization layer. We partition these efforts into two categories: (1) methods that sidestep the STE and (2) methods that improve codebook-model interactions.
Sidestepping the STE. Several prior works have sought to fix the problems caused by the STE by avoiding deterministic vector quantization. Baevski et al. (2019) employ the Gumbel-Softmax trick (Jang et al., 2016) to fit a categorical distribution over codebook vectors that converges to a one-hot distribution towards the end of training, Gautam et al. (2023) quantize using a convex combination of codebook vectors, and Takida et al. (2022) employ stochastic quantization. Unlike the above that cast vector quantization as a distribution over codebook vectors, Huh et al. (2023) propose an alternating optimization where the encoder is optimized to output representations close to the codebook vectors while the decoder minimizes reconstruction loss from a fixed set of codebook vector inputs. While these approaches sidestep the training instabilities caused by the STE, they can introduce their own set of problems and complexities such as low codebook utilization at inference and the tuning of a temperature schedule (Zhang et al., 2023). As a result, many applications and research papers continue to employ VQ-VAEs that are trained using the STE (Rombach et al., 2022;Chang et al., 2022;Huang et al., 2023;Zhu et al., 2023;Dong et al., 2023).
Codebook-Model Improvements. Another way to attack codebook collapse or under-utilization is to change the codebook lookup. Rather than use Euclidean distance, Yu et al. (2021) employ a cosine similarity measure, Goswami et al. (2024) a hyperbolic metric, and Lee et al. (2022) stochastically sample codes as a function of the distance between the encoder output and codebook vectors. Another perspective examines the learning of the codebook. Kolesnikov et al. (2022) split high-usage codebook vectors, Dhariwal et al. (2020); Łańcucki et al. (2020); Zheng & Vedaldi (2023) resurrect low-usage codebook vectors throughout training, Chen et al. (2024) dynamically selects one of m codebooks for each datapoint, and Mentzer et al. (2023); Zhao et al. (2024); Yu et al. (2023); Chiu et al. (2022) fix the codebook vectors to an a priori geometry and train the model without learning the codebook at all. Other works propose loss penalties to encourage codebook utilization. Zhang et al. (2023) add a KL-divergence penalty between codebook utilization and a uniform distribution while Yu et al. (2023) add an entropy loss term to penalize low codebook utilization. While effective at targeting specific training difficulties, as each of these methods continue to use the STE, the training instability caused by this estimator persist. Most of our experiments in Section 5 implement a subset of these approaches, and we find that replacing the STE with the rotation trick further improves performance.

Section: STRAIGHT THROUGH ESTIMATOR (STE)
In this section, we review the Straight-Through Estimator (STE) and visualize its effect on the gradients. We then explore two STE alternatives that-at first glance-appear to correct the approximation made by the STE.
For notation, we define a sample space X over the input data with probability distribution p. For input x ∈ X , we define the encoder as a deterministic mapping that parameterizes a posterior distribution p E (e|x). The vector quantization layer, Q(•), is a function that selects the codebook vector q ∈ C nearest to the encoder output e. Under Euclidean distance, it has the form:
Q(q = i|e) = 1 if i = arg min 1≤j≤|C| ∥e -q j ∥ 2 0 otherwise
The decoder is similarly defined as a deterministic mapping that parameterizes the conditional distribution over reconstructions p D (x|q). As in the VAE (Kingma & Welling, 2013), the loss function follows from the ELBO with the KL-divergence term zeroing out as p E (e|x) is deterministic and the utilization over codebook vectors is assumed to be uniform. Van Den Oord et al. (2017) additionally add a "codebook loss" term ∥sg(e) -q∥ 2 2 to learn the codebook vectors and a "commitment loss" term β∥e -sg(q)∥ 2 2 to pull the encoder's output towards the codebook vectors. sg stands for stopgradient and β is a hyperparameter, typically set to a value in [0.25, 2]. For predicted reconstruction x, the optimization objective becomes:
L(x) = ∥x -x∥ 2 2 + ∥sg(e) -q∥ 2 2 + β∥e -sg(q)∥ 2 2
In the subsequent analysis, we focus only on the ∥x -x∥ 2 2 term as the other two are not functions of the decoder. During backpropagation, the model must differentiate through the vector quantization where ∂L ∂q represents backpropagation through the decoder, ∂q ∂e represents backpropagation through the vector quantization layer, and ∂e ∂x represents backpropagation through the encoder. As vector quantization is not a smooth transformation, ∂q ∂e cannot be computed and gradients cannot flow through this term to update the encoder in backpropagation.
To solve the issue of non-differentiability, the STE copies the gradients from q to e, bypassing vector quantization entirely. Simply, the STE sets ∂q ∂e to the identity matrix I in the backward pass:
∂L ∂x = ∂L ∂q I ∂e ∂x
The first two terms ∂L ∂q ∂q ∂e combine to ∂L ∂e which, somewhat misleadingly, does not actually depend on e. As a consequence, the location of e within the Voronoi partition generated by codebook vector q-be it close to q or at the boundary of the region-has no impact on the gradient update to the encoder.
An example of this effect is visualized in Figure 2 for two example functions. In the STE approximation, the "exact" gradient at the encoder output is replaced by the gradient at the corresponding codebook vector for each Voronoi partition, irrespective of where in that region the encoder output e lies. As a result, the exact gradient field becomes "partitioned" into 16 different regions-all with the same gradient update to the encoder-for the 16 vectors in the codebook.
Returning to our question, is there a better way to propagate gradients through the vector quantization layer? At first glance, one may be tempted to estimate the curvature at q and use this information to transform ∂q ∂e as q moves to e. This is accomplished by taking a second order expansion around q to approximate the value of the loss at e:
L e ≈ L q + (∇ q L) T (e -q) + 1 2 (e -q) T (∇ 2 q L)(e -q)
Then we can compute the gradient at the point e instead of q up to second order approximation with:
∂L ∂e ≈ ∂ ∂e L q + (∇ q L) T (e -q) + 1 2 (e -q) T (∇ 2 q L)(e -q) = ∇ q L + (∇ 2 q L)(e -q)
While computing Hessians with respect to model parameters are typically prohibitive in modern deep learning architectures, computing them with respect to only the codebook is feasible. Moreover as we must only compute (∇ 2 q L)(e -q), one may take advantage of efficient Hessian-Vector products implementations in deep learning frameworks (Dagréou et al., 2024) and avoid computing the full Hessian matrix.
Extending this idea a step further, we can compute the exact gradient ∂L ∂e at e by making two passes through the network. Let L q be the loss with the vector quantization layer and L e be the loss without vector quantization, i.e. q = e rather than q = Q(e). Then one may form the total loss L = L q + λL e , where λ is a small constant like 10 -6 , to scale down the effect of L e on the decoder's parameters and use a gradient scaling multiplier of λ -1 to reweigh the effect of L e on the encoder's parameters to 1. As ∂q ∂e is non-differentiable, gradients from L q will not flow to the encoder. While seeming to correct the encoder's gradients, replacing the STE with either approach will likely result in worse performance. This is because computing the exact gradient with respect to e is actually the AutoEncoder (Hinton & Zemel, 1993) gradient, the model that VAEs (Kingma & Welling, 2013) and VQ-VAEs (Van Den Oord et al., 2017) were designed to replace given the AutoEncoder's propensity to overfit and difficultly generalizing. Accordingly using either Hessian approximation or exact gradients via a double forward pass will cause the encoder to be trained like an AutoEncoder and the decoder to be trained like a VQ-VAE. This mis-match in optimization objectives is likely another contributing factor to the poor performance we observe for both methods in Table 1, and a deeper analysis into these characteristics is presented in Appendix A.3.

Section: THE ROTATION TRICK
As discussed in Section 3, updating the encoder's parameters by approximating, or exactly, computing the gradient at the encoder's output is undesirable. Similarly, the STE appears to lose information: the location of e within the quantized region-be it close to q or far away at the boundary-has no impact on the gradient update to the encoder. Capturing this information, i.e. using the location of e in relation to q to transform the gradients through ∂q ∂e , could be beneficial to the encoder's gradient updates and an improvement over the STE.
Viewed geometrically, we ask how to move the gradient ∇ q L from q to e, and what characteristics of ∇ q L and q should be preserved during this movement. The STE offers one possible answer: move the gradient from q to e so that its direction and magnitude are preserved. However, this paper supplies a different answer: move the gradient so that the angle between ∇ q L and q is preserved as ∇ q L moves to e. We term this approach "the rotation trick", and in Section 4.3 we show that preserving the angle between q and ∇ q L conveys desirable properties to how points move within the same quantized region.

Section: THE ROTATION TRICK PRESERVES ANGLES
In this section, we formally define the rotation trick. For encoder output e, let q = Q(e) represent the corresponding codebook vector. Q(•) is non-differentiable so gradients cannot flow through this layer during the backward pass. The STE solves this problem-maintaining the direction and magnitude of the gradient ∇ q L-as ∇ q L moves from q to e with some clever hacking of the backpropagation function in deep learning frameworks: q = e -(q -e) constant which is a parameterization of vector quantization that sets the gradient at the encoder output to the gradient at the decoder's input. The rotation trick offers a different parameterization: casting the forward pass as a rotation and rescaling that aligns e with q:
Gradient at Rotation Trick STE moves to e via the STE (middle) and rotation trick (right). The STE "copies-and-pastes" the gradient to preserve its direction while the rotation trick moves the gradient so the angle between q and ∇qL is preserved (proved in Appendix A.6).

Section: q =
∥q∥ ∥e∥ R constant e R is the rotation1 transformation that aligns e with q and ∥q∥ ∥e∥ rescales e to have the same magnitude as q. Note that both R and ∥q∥ ∥e∥ are functions of e. To avoid differentiating through this dependency, we treat them as fixed constants-or detached from the computational graph in deep learning frameworkswhen differentiating. This choice is explained in Appendix A.7.
While the rotation trick does not change the output of the forward pass, the backward pass changes. Rather than set ∂q ∂e = I as in the STE, the rotation trick sets ∂q ∂e to be a rotation and rescaling transformation:
∂ q ∂e = ∥q∥ ∥e∥ R
As a result, ∂q ∂e changes based on the position of e in the codebook partition of q, and notably, the angle between ∇ q L and q is preserved as ∇ q L moves to e. This effect is visualized in Figure 3. While the STE translates the gradient from q to e, the rotation trick rotates it so that the angle between ∇ q L and q is preserved. In a sense, the rotation trick and the STE are sibilings. They choose different characteristics of the gradient as desiderata and then preserve those characteristics as the gradient flows around the non-differentiable vector quantization operation to the encoder.

Section: EFFICIENT ROTATION COMPUTATION
The rotation transformation R that rotates e to q can be efficiently computed with Householder matrix reflections. We define ê = e ∥e∥ , q = q ∥q∥ , λ = ∥q∥ ∥e∥ , and r = ê+q ∥ê+q∥ . Then the rotation and rescaling that aligns e to q is simply:
q = λRe = λ(I -2rr T + 2qê T )e = λ[e -2rr T e + 2qê T e]
Due to space constraints, we leave the derivation of this formula to Appendix A.5. Parameterizing the rotation in this fashion avoids computing outer products and therefore consumes minimal GPU VRAM. Further, we did not detect a difference in wall-clock time between VQ-VAEs trained with the STE and VQ-VAEs trained with the rotation trick for our experiments in Section 5.

Section: VORONOI PARTITION ANALYSIS
In the context of lossy compression, vector quantization works well when the distortion, or equivalently quantization error ∥e -q∥ 2 2 , is low and the information capacity-equivalently codebook utilization-is high (Cover, 1999). Later in Section 5, we will see that VQ-VAEs trained with the rotation trick have this desiderata-often reducing quantization error by an order of magnitude and substantially increasing codebook usage-when compared to VQ-VAEs trained with the STE. However, the underlying reason why this occurs is less clear. Change in Distance Between and After an Update In this section, we analyze the effect of the rotation trick by looking at how encoder outputs that are mapped to the same Voronoi region are updated. While the STE applies the same update to all points within the same partition, the rotation trick changes the update based on the location of points within the Voronoi region. It can push points within the same region farther apart or pull them closer together depending on the direction of the gradient vector. The former capability can correspond to increased codebook usage while the latter to lower quantization error.
Voronoi Partition STE Updates Rotation Trick Updates
Let θ be the angle between e and q and ϕ be the angle between q and ∇ q L. When ∇ q L and q point in the same direction, i.e.
-π/2 < ϕ < π/2, encoder outputs with large angular distance to q are pushed farther away than they would otherwise be moved by the STE update. Figure 5 illustrates this effect. The points with large angular distance (blue regions) move further away from q than the points with low angular distance (ivory regions).
The top right partitions of Figure 4 present an example of this effect. The two clusters of points at the boundary-with relatively large angle to the codebook vector-are pushed away while the cluster of points with small angle to the codebook vector move with it. The ability to push points at the boundary out of a quantized region and into another is desirable for increasing codebook utilization. Specifically, codebook utilization improves when points are pushed into the Voronoi regions of previously unused codebook vectors. This capability is not shared by the STE, which moves all points in the same region by the same amount.
When ∇ q L and q point in opposite directions, i.e. π/2 < ϕ < 3π/2, the distance among points within the same Voronoi region decreases as they are pulled towards the location of the updated codebook vector. This effect is visualized in Figure 5 (green regions) and the bottom partitions of Figure 4 show an example. Unlike the STE update-that maintains the distances among points-the rotation trick pulls points with high angular distances closer towards the post-update codebook vector. This capability is desirable for reducing the quantization error and enabling the encoder to lock on (Van Den Oord et al., 2017) to a target codebook vector.
Taken together, both capabilities can form a push-pull effect that achieves two desiderata of vector quantization: increasing information capacity and reducing distortion. Encoder outputs that have large Published as a conference paper at ICLR 2025 angular distance to the chosen codebook vector are "pushed" to other, possibly unused, codebook regions by outwards-pointing gradients, thereby increasing codebook utilization. Concurrent with this effect, center-pointing gradients will "pull" points loosely clustered around the codebook vector closer together, locking on to the chosen codebook vector and reducing quantization error.

Section: FURTHER ANALYSIS
The Appendix contains several supplementary analyses. Appendix A.2 compares the rotation trick with the STE for a non-convex synthetic example; Appendix A.4 looks at the behavior far away from the origin; and Appendix A.8 analyzes the effect of using a reflection rather than a rotation. Finally, Appendix A.9 examines scaling the gradient's norm by ∥q∥ ∥e∥ and explores alternatives.

Section: EXPERIMENTS
In Section 4.3, we showed the rotation trick enables behavior that would increase codebook utilization and reduce quantization error by changing how points within the same Voronoi region are updated. However, the extent to which these changes will affect applications is unclear. In this section, we evaluate the effect of the rotation trick across many different VQ-VAE paradigms.
We begin with image reconstruction: training a VQ-VAE with the reconstruction objective of Van Den Oord et al. (2017) and later extend our evaluation to the more complex VQGANs (Esser et al., 2021), the VQGANs designed for latent diffusion (Rombach et al., 2022), and then the ViT-VQGAN (Yu et al., 2021). Finally, we evaluate VQ-VAE reconstructions on videos using a TimeSformer (Bertasius et al., 2021) encoder and decoder. Due to space constraints, the video results are presented in Appendix A.1. In total, our empirical analysis spans 11 different VQ-VAE configurations. For all experiments, aside from handling ∂q ∂e differently, the models, hyperparameters, and training settings are identical and described in Appendix A.10.

Section: VQ-VAE EVALUATION
We begin with a straightforward evaluation: training a VQ-VAE to reconstruct examples from ImageNet (Deng et al., 2009). Following Van Den Oord et al. (2017), our training objective is a linear combination of the reconstruction, codebook, and commitment loss:
L = ∥x -x∥ 2 2 + ∥sg(e) -q∥ 2 2 + β∥e -sg(q)∥ 2 2
where β is a hyperparameter scaling constant. Following convention, we drop the codebook loss term from the objective and instead use an exponential moving average to update the codebook vectors.
Evaluation Settings. For 256 × 256 × 3 input images, we evaluate two different settings: (1) compressing to a latent space of dimension 32 × 32 × 32 with a codebook size of 1024 following Yu et al. ( 2021) and (2) compressing to 64 × 64 × 3 with a codebook size of 8192 following Rombach et al. (2022). In both settings, we compare with a Euclidean and cosine similarity codebook lookup. Evaluation Metrics. We log both training and validation set reconstruction metrics. Of note, we compute reconstruction FID (Heusel et al., 2017) and reconstruction IS (Salimans et al., 2016) on reconstructions from the full ImageNet validation set as a measure of reconstruction quality. We also compute codebook usage, or the percentage of codebook vectors that are used in each batch of data, as a measure of the information capacity of the vector quantization layer and quantization error ∥e -q∥ 2 2 as a measure of distortion. Baselines. Our comparison spans the STE estimator (VQ-VAE), stochastic quantization with Gumbel-Softmax (Baevski et al., 2019), (Gumbel VQ-VAE) the Hessian approximation described in Section 3 (VQ-VAE w/ Hessian Approx), the exact gradient backward pass described in Section 3 (VQ-VAE w/ Exact Gradients), and the rotation trick (VQ-VAE w/ Rotation Trick). All methods share the same architecture, hyperparameters, and training settings, and these settings are summarized in Table 8 of the Appendix. There is no functional difference among methods in the forward pass; the only differences relates to how gradients are propagated through ∂q ∂e during backpropagation. Results. Table 1 displays our findings. We find that using the rotation trick reduces the quantization error-sometimes by an order of magnitude-and improves low codebook utilization. Both results are expected given the Voronoi partition analysis in Section 4.3: points at the boundary of quantized regions are likely pushed to under-utilized codebook vectors while points loosely grouped around the codebook vector are condensed towards it. These two features appear to have a meaningful effect on reconstruction metrics: training a VQ-VAE with the rotation trick substantially improves r-FID and r-IS.
We also see that the Hessian Approximation or using Exact Gradients results in poor reconstruction performance. While the gradients to the encoder are, in a sense, "more accurate", training the encoder like an AutoEncoder (Hinton & Zemel, 1993) likely introduces overfitting and poor generalization. Moreover, the mismatch in training objectives between the encoder and decoder is likely an aggravating factor and partly responsible for both models' poor performance.

Section: VQGAN EVALUATION
Moving to the next level of complexity, we evaluate the effect of the rotation trick on VQGANs (Esser et al., 2021). The VQGAN training objective is:
L VQGAN = L Per + ∥sg(e) -q∥ 2 2 + β∥e -sg(q)∥ 2 2 + λL Adv
where L Per is the perceptual loss from Johnson et al. (2016) and replaces the L 2 loss used to train VQ-VAEs. L Adv is a patch-based adversarial loss similar to the adversarial loss in Conditional GAN (Isola et al., 2017). β is a constant that weights the commitment loss while λ is an adaptive weight based on the ratio of ∇L Per to ∇L Adv with respect to the last layer of the decoder.
Experimental Settings. We evaluate VQGANs under two settings: (1) the paradigm amenable to autoregressive modeling with Transformers as described in Esser et al. (2021) and (2) the paradigm suitable to latent diffusion models as described in Rombach et al. (2022). The first setting follows the convolutional neural network and default hyperparameters described in Esser et al. (2021) while the second follows those from Rombach et al. (2022). A full description of both training settings is provided in Table 9 of the Appendix.
Results. Our results are listed in Table 2 for the first setting and Table 3 for the second. Similar to our findings in Section 5.1, we find that training a VQ-VAE with the rotation trick substantially decreases quantization error and improves codebook usage. Moreover, reconstruction performance as measured on the validation set by the total loss, r-FID, and r-IS are improved across both modeling paradigms.

Section: VIT-VQGAN EVALUATION
Improving upon the VQGAN model, Yu et al. (2021) propose using a ViT (Dosovitskiy, 2020)  Results. Table 4 summarizes our findings. Similar to our previous results for VQ-VAEs in Section 5.1 and VQGANs in Section 5.2, codebook utilization and reconstruction metrics are significantly improved; however in this case, the quantization error is roughly the same.

Section: LIMITATIONS


Section: Rotation Trick STE
Figure 6: Illustration of the rotation trick "over-rotating" vectors when the angle between e1 and q is obtuse.
A limitation of the rotation trick can arise when the encoder outputs or codebook vectors are forced to be close to 0 norm (i.e., ∥e∥ ≈ 0 or ∥q∥ ≈ 0). In this case, the angle between e and q may be obtuse. When this happens, the rotation trick will "over-rotate" the gradient ∇ q L as it is transported from q to e so that ∇ q L and ∇ e L now point in different directions (i.e. the cosine of the angle between ∇ e L and ∇ q L will be negative). An example is visualized in Figure 6. This is undesirable because-when the angle between e and q is obtuse-the rotation trick will violate the assumption that when e ≈ q, ∇ q L ≈ ∇ e L, and it will likely result in worse performance than VQ-VAEs trained with the STE. While obtuse angles between e and q are very unlikely-by design, the codebook vectors should be "angularly close" to the vectors that are mapped to them-however, if there is a restriction that forces codewords to have near 0 norm, then the rotation trick will likely perform worse than the STE.

Section: CONCLUSION
In this work, we explore different ways to propagate gradients through the vector quantization layer of VQ-VAEs and find that preserving the angle-rather than the direction-between the codebook vector and gradient induces desirable effects for how points within the same codebook region are updated. These effects cause a substantial improvement in model performance. Across 11 different settings, we find that training VQ-VAEs with the rotation trick improves their reconstructions. For example, training one of the VQGANs used in latent diffusion with the rotation trick improves r-FID from 5.0 to 1.1 and r-IS from 141.5 to 200.2, reduces quantization error by two orders of magnitude, and increases codebook usage by 13.5x. A.1 VIDEO EVALUATION Expanding our analysis beyond the image modality, we evaluate the effect of the rotation trick on video reconstructions from the BAIR Robot dataset (Ebert et al., 2017) and from the UCF101 action recognition dataset (Soomro, 2012). We follow the quantization paradigm used by ViT-VQGAN, but replace the ViT with a TimeSformer (Bertasius et al., 2021)   To supplement our analysis in Section 4.3, we include a numerical simulation of vector quantization for minimizing Himmelblau's function (Figure 7) across 100 gradient updates for the STE and rotation trick gradient estimators to highlight the differences in their behaviors. Our simulation uses an EMA with a decay rate of 0.8 as described in Van Den Oord et al. (2017) to update the codebook vectors and a learning rate of 1e -3 to update the pre-quantized points.

Section: A APPENDIX
Points for both the STE and the rotation trick simulation use the same random initialization for both codewords and pre-quantized vectors. The only difference is whether the STE or the rotation trick is used as the gradient estimator through the vector quantization operation. 

Section: A.3 HESSIAN APPROXIMATION AND EXACT GRADIENT ANALYSIS
In this section, we expand our analysis in Section 3 and offer some intuition for why using exact gradients, or a Hessian approximation of the exact gradients, may convey undesirable characteristics. We begin by showing the Hessian approximates the exact gradient up to second order term with a
The Taylor series expansion. We can write the loss L e exactly as an infinite series of around q:
L e = L q + (∇ q L) T (e -q) + 1 2 (e -q) T (∇ 2 q L)(e -q) + 1 6 (e -q) T ∇ 3 q L(e -q, e -q) + . . . . so that the loss computed by the Hessian approximation differs from the loss computed with the exact gradients method by the remainder term from truncating the Taylor series expansion after the second term:
{L e } Hessian = L q + (∇ q L) T (e -q) + 1 2 (e -q) T (∇ 2 q L)(e -q)
When differentiating both of these losses to compute the gradients, the difference between the exact gradient update and the Hessian update is:
∂L e ∂e -{ ∂L e ∂e } Hessian = ∂ ∂e O(∥e -q∥ 3 )
where O(∥e -q∥ 3 ) = 1 6 (e -q) T ∇ 3 q L(e -q, e -q) + . . . Hessian. As the loss in each partition is quadratic, the exact gradient will equal the Hessian approximation.
Notice that when q ≈ e, ∇qL ≈ ∇eL for both the STE and the rotation trick. As the Hessian approximation and exact gradients use the curvature of the loss surface to move ∇qL from q to e, the direction of the gradient can change substantively, even when q ≈ e.
The Hessian idea described in Section 3 approximates the exact gradients to the encoder as if quantization did not occur, i.e. it approximates the gradient used to update the encoder in the original AutoEncoder (Hinton & Zemel, 1993) model.
We now explore some instances where the exact gradients, or their Hessian approximation, may produce undesirable behavior in vector quantization. An inductive bias (Baxter, 2000) for vector quantization to work well is that when e is "close" to q, their gradients are also "close", i.e. if e ≈ q then ∇ e L ≈ ∇ q L. Intuitively, if the distortion between e and q is small-i.e. q is a very good codeword for e-then these points should move together during a gradient update. If they do not, the distortion would increase.
This assumption holds for both the STE and Rotation Trick gradients; however, it can be violated by the Hessian approximation or the exact gradient approaches, especially when the curvature around q is negative or the Hessian is indefinite and forms a saddle point.
Figure 9 illustrates three such cases. As both the STE and Rotation Trick do not use the loss surface to move ∇ q L from q to e, when q ≈ e, ∇ q L ≈ ∇ e L. However, approaches that use the curvature around q, such as the Hessian approximation or exact gradients, to either find or approximate the loss at e can have ∇ e L point in a very different direction from ∇ q L, even when q is close to e. The top-left and bottom partitions of Figure 9 scatter the gradients as they move from q to the points in these partitions due to negative curvature. A similar effect occurs in the top-right partition of Figure 9 due to the presence of a saddle point.

Section: A.4 BEHAVIOR AWAY FROM THE ORIGIN
Unlike the STE, the rotation trick is not invariant to the location of the origin. In this section, we explore this characteristic and its effect on how points within the same Voronoi region are updated. For example, suppose each codebook vector and encoder output in Figure 4 were shifted by some at the codebook vector (orange circle) when all points are far from the origin. The STE is invariant to the this however as the angle between e and q decreases as these vectors translated away from the origin, the effect of the rotation trick will decrease. In the limit, the rotation trick reduces to the STE. constant vector so that each now has all positive components. How would this affect the rotation trick's gradient estimator?
Figure 11: Illustration of codebook and encoder output shifted away from the origin by a constant vector d. The angle after the shift is smaller than the angle before the shift: θ < θ.
Consider one codebook vector q and one encoder output e separated by angle θ. We define q = q + d and ê = e + d where d is some large displacement vector. Let θ be the angle between q and ê. We visualize this example in Figure 11. From the law of cosines:
∥q -e∥ 2 = ∥q∥ 2 + ∥e∥ 2 -2∥q∥∥e∥ cos(θ)
and
∥q -ê∥ 2 = ∥q -e∥ 2 = ∥q∥ 2 + ∥ê∥ 2 -2∥q∥∥ê∥ cos θ
Substituting, we find that
cos θ = ∥q∥ 2 + ∥e∥ 2 -2∥q∥∥e∥ cos(θ) -∥q + d∥ 2 -∥e + d∥ 2 -2∥q + d∥∥e + d∥
and consider the case when q and ê are far from the origin, i.e.,∥d∥ >> ∥q∥, ∥e∥. Then we have:
cos θ ≈ -2∥d∥ 2 -2∥d∥ 2 = 1
So as d → ∞, θ → 0. This implies that ∥q∥ ∥ê∥ → 1 and R → I, which is exactly the STE update. As points move away from the origin, the rotation trick smoothly transforms into the STE.
We visualize an example of this effect in Figure 10, where each point from Figure 4 is translated by positive ten along each dimension. As illustrated above, the effect for the "push" gradient in the top-right quadrant remains but it's effect is reduced, i.e., more similar to the STE update. The top-left partition becomes a "pull" because the gradient now points towards the origin, so points within this region move closer together. Finally, the gradient in the bottom region no longer points towards the origin, but is now more orthogonal to the codebook vector. As a result, we see more of a rotation applied to the points in this region than the contraction that is depicted in Figure 4.

Section: A.5 HOUSEHOLDER REFLECTION TRANSFORMATION
For any given e and q, the rotation R that aligns e with q in the plane spanned by both vectors can be efficiently computed with Householder matrix reflections.
Definition 1 (Householder Reflection Matrix). For a unit norm vector a ∈ R d , I -2aa T ∈ R d×d is reflection matrix across the subspace (hyperplane) orthogonal to a. Returning to vector quantization with q = [ ∥q∥ ∥e∥ R]e, we can write R as the product of two Householder reflection matrices that rotates e to q in the plane spanned between them. Without loss of generality, assume e and q are unit norm, and let θ be the angle between e and q. Setting r = e+q ∥e+q∥ and simplifying yields:
R = (I -2qq T )(I -2rr T ) = I -2qq T -2rr T + 4qq T rr T = I -2qq T -2rr T + 4q q T r r T = I -2qq T -2rr T + 4q q T e + q ∥e + q∥ r T = I -2qq T -2rr T 4q q T e + q T q ∥e + q∥ r T = I -2qq T -2rr T + 4q ∥q∥∥e∥ cos θ + ∥q∥∥q∥ ∥e + q∥ r T = I -2qq T -2rr T + 4q cos θ + 1 ∥e + q∥ r T = I -2qq T -2rr T + 4q ∥e + q∥ 2 2∥e + q∥ r T = I -2qq T -2rr T + 4∥e + q∥ 2 2∥e + q∥ qr T = I -2qq T -2rr T + 4∥e + q∥ 2 2∥e + q∥ 2 q(e + q) T = I -2qq T -2rr T + 2qe T + 2qq T = I -2rr T + 2qe T

Section: A.6 PROOF THE ROTATION TRICK PRESERVES ANGLES
For encoder output e and corresponding codebook vector q, we provide a formal proof that the rotation trick preserves the angle between ∇ q L and q as ∇ q L moves to e. Unlike the notation in the main text, which assumes q ∈ R d×1 , we use batch notation in the following proof to illustrate how the rotation trick works when training neural networks. Specifically, q ∈ R b×d and R ∈ R b×d×d where b is the number of examples in a batch and d is the dimension of the codebook vector.
Remark 3. The angle between q and ∇ q L is preserved as ∇ q L moves to e.
Proof. With loss of generality, suppose ∥e∥ = ∥q∥ = 1. Then we have
q = eR T ∂q ∂e = R
The gradient at e will then equal:
∇ e L = ∇ q L ∂q ∂e = ∇ q L [R]
Let θ be the angle between q and ∇ q L and ϕ be the angle between e and ∇ q L. Via the Euclidean inner product, we have:
∥∇ q L∥ cos θ = q [∇ q L] T = eR T [∇ q L] T = e [∇ q LR] T = e [∇ e L] T
= ∥∇ q L∥ cos ϕ so θ = ϕ and the angle between q and ∇ q L is preserved as ∇ q L moves to e.

Section: A.7 TREATING R AND ||q||
||e|| AS CONSTANTS
In the rotation trick, we treat R and ||q|| ||e|| as constants and detached from the computational graph during the forward pass of the rotation trick. In this section, we explain why this is the case.
The rotation trick computes the input to the decoder q after performing a non-differentiable codebook lookup on e to find q. It is defined as:
q = ||q|| ||e|| Re
As shown in Section 4, R is a function of both e and q. However, using the quantization function Q(e) = q, we can rewrite both ||q|| ||e|| and R as a single function of e:
f (e) = ∥Q(e)∥ ∥e∥ I -2 e + Q(e) ∥e + Q(e)∥ e + Q(e) ∥e + Q(e)∥ T + 2Q(e)e T = ∥q∥ ∥e∥ R
The rotation trick then becomes
q = f (e)e
and differentiating q with respect to e gives us:
∂ q ∂e = f ′ (e)e + f (e)
However, f ′ (e) cannot be computed as it would require differentiating through Q(e), which is a nondifferentiable codebook lookup. We therefore drop this term and use only f (e) as our approximation of the gradient through the vector quantization layer: ∂ q ∂e = f (e). This approximation conveys more information about the vector quantization operation than the STE, which sets ∂ q ∂e = I.

Section: Gradient at Rotation Trick Reflection Trick STE
Figure 12: Illustration of how the gradient at q moves to e via the STE, the rotation trick, and the reflection trick. The reflection trick matches the behavior of the rotation trick when the gradient ∇qL is parallel to q. However, it will reverse the components of the gradients orthogonal to q for points in q's partition. This effect is illustrated in the bottom two rows of the rightmost column.  

Section: A.8 THE REFLECTION TRICK
One may also use a single reflection to align e to q, rather than a rotation. For instance, using the notation from Appendix A.5, setting r = e-q ∥e-q∥ and reflecting across the plane orthogonal to this vector via the Householder reflection (I -2rr T ) will reflect e to q. We denote this reflection as R so that q = ∥q∥ ∥e∥ Re. We call this approach "the reflection trick."
The reflection trick can result in undesirable behavior during the backward pass. While it replicates the rotation trick when ∇ q L is parallel to q, as illustrated in the top two rows of Figure 12 and the top-right and bottom regions of Figure 13, it reflects orthogonal components of the gradient across the hyperplane orthogonal to e -q so that these components are reversed. Simply, if the quantized gradient points "left" then the reflected gradient will point "right", and vice-versa. This behavior is undesirable for points with low distortion, e ≈ q, because it will cause e to move away from q along the components of the gradient orthogonal to q, thereby increasing distortion for two points that are a "good match". The top-left partition of Figure 13 illustrates one such example. In this case, the gradient pushes the codebook vector "left" while the points in this region are pushed in the opposite direction of the gradient.
We evaluate this effect experimentally following the VQ-VAE evaluation paradigm from Table 1 and the VQGAN evaluation paradigm from Table 3. While we did not train these models to completion due to GPU resource limitations, both paradigms exhibited poor convergence when trained with the reflection trick. Specifically, after one epoch, the validation loss was approximately 3x higher than the rotation trick for both 8192 and 16384 codebook VQGANs in Table 3. For the Euclidean codebook model with latent Shape 64 × 64 × 3 in Table 1, the validation loss was approximately 2x higher than the rotation trick after 15 epochs.

Section: A.9 GRADIENT NORM SCALING IN THE ROTATION TRICK
In this section, we analyze the effect of the ∥q∥ ∥e∥ term in the rotation trick. While this norm rescaling is necessary to transform e into q during the forward pass, one could avoid the multiplicative factor by instead formulating the rotation trick as:
q = R constant e + (q -Re) constant
A possible benefit of this latter formulation is that ∂q ∂e = R, an orthogonal transformation with determinant one that does not shrink or expand space by a factor of ∥q∥ ∥e∥ . In this section, we analyze the differences between these two approaches and formulate both as specific instantiations of a more general family of rotation-based gradient approximations.
A.9.1 COMPARISON BETWEEN ∥q∥ ∥e∥ AND (q -Re)
An inductive bias of vector quantization is that when e ≈ q, then ∇ e L ≈ ∇ q L. Simply, when the distortion between e and q is small, the gradient for both e and q should be approximately the same. However when ∥e∥ ≈ 0 and a Euclidean metric is used to determine the closest codebook vector, the angle between e and q can be obtuse as illustrated in Figure 6. In this instance, the rotation trick will cause the gradient ∇ e L to "over-rotate" and point away from ∇ q L.
Using a grad scaling of ||q|| ||e|| can fix this. When ||e|| ≈ 0 and ||e|| < ||q||, the norm of the gradient will be scaled up to push e away from the origin. Pushing e away from the origin makes the angle between e and q more of a factor when computing the Euclidean distance:
∥e -q∥ = ∥e∥ 2 + ∥q∥ 2 -2∥e∥∥q∥ cos θ
so e is more likely to map to a different q that forms an acute angle with it as ∥e∥ increases.
Now consider if ∥q∥ ≈ 0 and ∥e∥ > ∥q∥. When this occurs, the update to e will vanish because ∥q∥ ∥e∥ ≈ 0. This behavior may also be desirable because when q is close to the origin, there's a higher likelihood the angle between e and q would be obtuse.
We also explore this factor in ablation experiments for VQ-VAEs and VQGANs. Table 6 mirrors Table 1 and summarizes our findings for VQ-VAEs while Table 7 mirrors Table 3 and summarizes our findings for the VQGANs used in latent diffusion. In Table 6, we do not observe a difference between using q = ||q|| ||e|| Re and q = Re + (q -Re). However, for the VQGAN results in Table 7, we find that using the grad scaling factor modestly improves performance.
Table 6: Comparison of the rotation trick using q = ∥q∥ ∥e∥ Re with using q = Re + (q -Re) for VQ-VAE models. The experimental setting follows Table 1. Table 7: Comparison of the rotation trick using q = ∥q∥ ∥e∥ Re with using q = Re + (q -Re) for VQGAN models. The models with codebook size were stopped after 2 epochs while the models with codebook size 16384 were stopped after 3 epochs. A.9.2 GENERAL FAMILY OF ROTATION-BASED GRADIENT ESTIMATORS Generalizing the additive and multiplicative formulations of the rotation trick, we formulate both as specific instantiations of a more general family:
q = γ(e)Re + (q -γ(e)Re)
where γ(e) determines the multiplicative scaling factor. For q = ∥q∥ ∥e∥ Re, γ(e) = ∥q∥ ∥e∥ and for q = Re + (q -Re), γ(e) = 1. However, one can explore other scaling factors such as
γ(e) = 1 8∥q -e∥ 2
We visualize the gradient fields for different formulations of γ(e) in Figure 14.
It is almost certain that other formulations of γ(e) from the ones we explore in this work would improve the training dynamics or performance of VQ-VAEs. In particular, a priori fixing γ(e) to satisfy an inductive bias or developing an adaptive scaling factor that dynamically sets γ(e) similar to the functions that adapt task weights in multi-task learning throughout training (Kendall et al., 2018;Chen et al., 2018) are exciting directions for future work.

Section: A.10 TRAINING SETTINGS
We detail the training settings used in our experimental analysis in Section 5. While a text description can be helpful for understanding the experimental settings, our released code should be referenced to fully reproduce the results presented in this work.
A.10.1 VQ-VAE EVALUATION.
Table 8 summarizes the hyperparameters used for the experiments in Section 5.1. For the encoder and decoder architectures, we use the Convolutional Neural Network described by Esser et al. (2021).
The hyperparameters for the cosine similarity codebook lookup follow from Yu et al. (2021) and the hyperparameters for the Euclidean distance codebook lookup follow from the default values set in the Vector Quantization library from https://github.com/lucidrains/vector-quantize-pytorch. All models replace the codebook loss with the exponential moving average described in Van Den Oord et al. (2017) with decay = 0.8. The notation for both encoder and decoder architectures is adapted from Esser et al. (2021).
For the Gumbel VQ-VAE baseline, we follow the implementation of https://github.com/karpathy/ deep-vector-quantization and use the suggested schedule to attenuate the softmax temperature from 1.0 to 1 16 over the course of training. Aside from the difference in quantization, i.e. deterministic work and do not claim that improving reconstruction codebook usage, or quantization error in "Stage 1" VQ-VAE training will lead to improvements in "Stage 2" generative modeling applications.
While poor reconstruction performance will clearly lead to poor generative modeling, recent work (Yu et al., 2023) suggests that-at least for autoregressive modeling of codebook sequences with MaskGit (Chang et al., 2022)-the connection between VQ-VAE reconstruction performance and downstream generative modeling performance is non-linear. Specifically, increasing the size of the codebook past a certain amount will improve VQ-VAE reconstruction performance but make downstream likelihood-based geneative modeling of codebook vectors more difficult.
We believe this nuance may extend beyond MaskGit, and that the desiderata for likelihood-based generative models will likely be different than that for score-based generative models like diffusion.
It is even possible that different preferences appear within the same class. For example, left-to-right autoregressive modeling of codebook elements with Transformers (Vaswani, 2017) may exhibit different preferences for Stage 1 VQ-VAE models than those of MaskGit.
These topics deserve a deep, and rich, analysis that we would find difficult to include within this work as our focus is on propagating gradients through vector quantization layers. As a result, we entrust the exploration of these questions to future work.
A.12 GRADIENT ESTIMATORS AS PARALLEL TRANSPORT
In this section, we analyze the STE and the rotation trick through the lens of differential geometry, specifically as the parallel transport of the gradient ∇ q L vector from the codeword q to the encoder output e. For this analysis in this section, we only consider the rotational component R θ of the rotation trick, not the rescaling by ∥q∥ ∥e∥ .
A.12.1 BACKGROUND ON HYPERSPHERICAL COORDINATES Hyperspherical coordinate systems are ubiquitous in applications of math and physics, where certain formulas become greatly simplified by parameterizing the location of points by the radius and angles to coordinate axes. An familiar instantiation of the hyperspherical coordinate system may be polar coordinates with radial component r and polar angle θ:
x = r cos θ y = r sin θ or the instantiation of the hyperspherical coordinate system for three dimensions, otherwise known as spherical coordinates, with radial component r, polar angle θ and azimuthal angle ϕ: We outline one common conversion from Cartesian coordinates to hyperspherical coordinates below, and other conversions are equivalent up to permutation of the coordinate axes:
x 1 = r cos(θ 1 )
x 2 = r sin(θ 1 ) cos(θ 2 )
x 3 = r sin(θ 1 ) sin(θ 2 ) cos(θ 3 ) . . .
x d-1 = r sin(θ 1 ) • • • sin(θ d-2 ) cos(θ d-1 ) x d = r sin(θ 1 ) • • • sin(θ d-2 ) sin(θ d-1 )
Cartesian Coordinate System Spherical Coordinate System and the reverse transform from Cartesian coordinates to hyperspherical coordinates:
r = (x 1 ) 2 + (x 2 ) 2 + ... + (x d ) 2 θ 1 = arctan 2( (x d ) 2 + ... + (x 2 ) 2 , x 1 ) θ 2 = arctan 2( (x d ) 2 + ... + (x 3 ) 2 , x 2 )
. . .
θ d-2 = arctan 2( (x d ) 2 + (x d-1 ) 2 , x d-2 ) θ d-1 = arctan 2( (x d ) 2 , x d-1 )
where arctan 2(x, y) returns the angle measurement in radians over the support (-π, π] between between x and y.
Unlike the Cartesian coordinate system, the hyperspherical basis vectors are not identical over the entire space; they change with position. For instance, moving outwards along r will increase the length of ∂ ∂θ i as an infinitesimal change in θ i will now cover a larger arclength distance-i.e. the line segment traveled by changing the angle θ i -than that same infinitesimal change with a smaller r. This effect is visualized for three dimensions in Figure 19.
At any given point in hyperspherical coordinates p, the transformation from Cartesian basis vectors 
∂ ∂θ i = d k=1 ∂x k ∂θ i ∂ ∂x k
where ∂x i ∂θ i can be computed from the coordinate transform functions, i.e. x 1 = r cos(θ 1 ). It is typical to express these relationships in a matrix that transforms an arbitrary vector v in Cartesian coordinates at point p to its counterpart in hyperspherical coordinates ṽ at p:
∂ ∂x 1 ∂ ∂x 2 • • • ∂ ∂x d       ∂x 1 ∂r ∂x 1 ∂θ 1 • • • ∂x 1 ∂θ d-1 ∂x 2 ∂r ∂x 2 ∂θ 1 • • • ∂x 2 ∂θ d-1 . . . . . . . . . . . . ∂x d ∂r ∂x d ∂θ 1 • • • ∂x d ∂θ d-1       The Jacobian J = ∂ ∂r ∂ ∂θ 1 • • • ∂ ∂θ d-1
As illustrated in Figure 19, J does not necessarily have determinant equal to one and changes as a function of position, so the norms of the basis vectors spanning the hyperspherical tangent space change based on position. More generally, this notion of distance is captured by the line element: the length of a line segment resulting from an infinitesimal change along the coordinate axes. The Cartesian line element is given by:
ds 2 = (dx 1 ) 2 + (dx 2 ) 2 + ... + (dx d ) 2
while the hyperspherical line element is:
ds 2 = dr 2 + r 2 (dθ 1 ) 2 + r 2 sin 2 θ 1 (dθ 1 ) 2 + r 2 d-1 i=2 sin 2 θ i (dθ d-1 ) 2
which reflects that distance traveled by small changes in the hyperspherical coordinates "increases" with increasing radius and "decreases" with distance from the equator. To ensure that the norm of the basis vectors does not change during conversion, it is common to renormalize hyperspherical basis vectors to have unit norm for all points. However, a notion of norm is not defined a priori for hyperspherical vectors; the metric tensor imposed on this space defines the inner product which in turn defines a sense of arclength.
Using the induced metric from Cartesian coordinates, we can inherit the inner product from Cartesian coordinates on the hyperspherical coordinate system by expressing hyperspherical basis vectors as a linear combination of Cartesian basis vectors and then computing the norm of this resulting vector in the Cartesian tangent space:
∥ ∂ ∂θ i ∥ = ⟨ ∂ ∂θ i , ∂ ∂θ i ⟩ = d k=1 ∂x k ∂θ i ∂ ∂x k •   d j=1 ∂x j ∂θ i ∂ ∂x j   = d k=1 ∂x k ∂θ i ∂x k ∂θ i ∂ ∂x k • ∂ ∂x k = d k=1 ( ∂x k ∂θ i ) 2
The first fundamental form gives us the normalization constants:
I =        1 2 0 0 ... 0 0 r 2 0 ... 0 0 0 r 2 sin 2 θ 1 ... 0 . . . . . . . . . . . . . . . 0 0 0 ... r 2 d-1 i=1 sin 2 θ i       
as the diagonal represents the inner product ⟨ ∂ ∂θ i , ∂ ∂θ i ⟩, and we would like to renormalize each basis vector to have unit norm:
∥ ∂ ∂θ i ∥ = ⟨ ∂ ∂θ i , ∂ ∂θ i ⟩. Therefore, our normalized hyperspherical basis vectors ∂ ∂ r , ∂ ∂ θ1
, ... become:
∂ ∂ r = ∂ ∂r ∂ ∂ θi = (I ii ) -1 2 ∂ ∂θ i
Using our convention from earlier, we can now compute the transformation from Cartesian basis vectors to normalized hyperspherical basis vectors:
∂ ∂ θi = (I ii ) -1 2 d k=1 ∂x k ∂θ i ∂ ∂x k
to compose the normalized "Jacobian" Ĵ:
∂ ∂x 1 ∂ ∂x 2 • • • ∂ ∂x d        ∂x 1 ∂ r ∂x 1 ∂ θ1 • • • ∂x 1 ∂ θd-1 ∂x 2 ∂ r ∂x 2 ∂ θ1 • • • ∂x 2 ∂ θd-1 . . . . . . . . . . . . ∂x d ∂ r ∂x d ∂ θ1 • • • ∂x d ∂ θd-1        Ĵ∈SO(d) = ∂ ∂ r ∂ ∂ θ1 • • • ∂ ∂ θd-1 (1)
Rescaling the hyperspherical basis vectors to have unit norm at all points causes the matrix J to become the orthogonal matrix with determinant equal to one Ĵ. This set of d × d matrices belongs to the group SO(d), which represents the set of d-dimensional rotations about the origin. Similarly, the backwards change-of-basis Ĵ-1 = ĴT converts vectors in hyperspherical coordinates to Cartesian coordinates.
As a result, vectors from the tangent space at p in Cartesian coordinates simply rotate to convert to the normalized tangent space at p in hyperspherical coordinates. Specifically, for a point p = (r, θ 1 , θ 2 , ..., θ d-1 ) and a vector ṽ = c1 r + c2 θ 1 + ... + cd θ d-1 , converting v = c 1 x 1 + ... + c d x d from Cartesian to hyperspherical coordinates is the transformation: ṽ = ĴT v where Ĵ operates on vector v-i.e. Ĵv-by first rotating by angle c2 in the x 1 -x 2 plane (i.e. the θ 1 axis of rotation), then by angle c3 in the x 2 -x 3 plane (i.e. the θ 2 axis of rotation), so on and so forth until a final rotation by angle cd in the x d-1 -x d plane (i.e. the θ d-1 axis of rotation). Composing these rotations together leads to a rotation from p0 = (1, 0, 0, ..., 0) to p:
Ĵv = (R p0→ p)v = (R x d-1 -x d θ d • • • R x 2 -x 3 θ2 R x 1 -x 2 θ1 )v Ĵ-1 v = ĴT v = (R p0→ p) T v = (R x d-1 -x d θ d • • • R x 2 -x 3 θ2 R x 1 -x 2 θ1 ) T v = R p→ p0 v
where we define R ã→ b to be the rotation from ã to b as described above and R x i -x j θi to be the rotation by angle θ i in the x i -x j plane. Important for our later discussion on the rotation trick, this rotational characteristic causes moving a fixed vector along a curve in hyperspherical coordinates to rotate in Cartesian coordinates. Remark 4. Using the renormalized transformation in Equation (1), a constant vector field ṽ in hyperspherical coordinates corresponds to a rotated vector field in Cartesian coordinates.
Proof. At Cartesian point p and corresponding hyperspherical point p:
v T p R x d-1 -x d θ d • • • R x 2 -x 3 θ2 R x 1 -x 2 θ1 = ṽT p R x d-1 -x d θ d • • • R x 2 -x 3 θ2 R x 1 -x 2 θ1 T v p = ṽp [R p→ p0 ] v p = ṽp
so a constant vector field ṽ in hyperspherical coordinates will correspond to a cartesian vector field where each vector at point p is rotated by the rotation that alights p to p0 .
Another important characteristic relates to the metric tensor with normalized hyperspherical basis vectors. We can explicitly compute the induced metric in hyperspherical coordiantes in terms of our renormalized basis vectors:
Î =         ∂ ∂ r • ∂ ∂ r 0 0 ... 0 0 ∂ ∂ θ1 • ∂ ∂ θ1 0 ... 0 0 0 ∂ ∂ θ2 • ∂ ∂ θ2 ... 0 . . . . . . . . . . . . . . . 0 0 0 ... ∂ ∂ θd-1 • ∂ ∂ θd-1         =         (I 11 ) -1 ∂ ∂r • ∂ ∂r 0 0 ... 0 0 (I 22 ) -1 ∂ ∂θ1 • ∂ ∂θ1 0 ... 0 0 0 (I 33 ) -1 ∂ ∂θ2 • ∂ ∂θ2 ... 0 . . . . . . . . . . . . . . . 0 0 0 ... (I dd ) -1 ∂ ∂θ d-1 • ∂ ∂θ d-1         =       (I 11 ) -1 (I 11 ) 0 0 ... 0 0 (I 22 ) -1 (I 22 ) 0 ... 0 0 0 (I 33 ) -1 (I 33 ) ... 0 . . . . . . . . . . . . . . . 0 0 0 ... (I dd ) -1 (I dd )       =       1 0 0 ... 0 0 1 0 ... 0 0 0 1 ... 0 . . . . . . . . . . . . . . . 0 0 0 ... 1      (2)
which yields the identity matrix. This is perhaps unsurprising: we normalize basis vectors so that
∂ ∂ θi • ∂ ∂ θi = 1.
Another way to view the renormalized tangent plane transformation is as a change-of-basis in the Cartesian coordiante system, with the basis vectors spanning each tangent space in Cartesian coordinates rotated to align with the directions of hyperspherical basis vectors. Rotating the tangent space at each point in Cartesian coordinates does not change the Euclidean metric tensor-R T IR = I-so it remains the identity. These two formulations are equivalent: renormalizing the basis vectors in the hyperspherical tangent space to have unit norm corresponds to rotating the basis vectors in the Cartesian coordinate system to align with the hyperspherical basis vectors at all points.
A.12.2 STE AS PARALLEL TRANSPORT From the description of the STE in Bengio et al. (2013), a gradient vector ∇ q L is transported from q to e during the backwards pass in such a way that its direction and magnitude is preserved. Critically, the curve along which ∇ q L is transported is not specified; the effect is to simply "copy-and-paste" the vector from q to e.
To use the machinery of calculus, we assume that ∇ q L is transported from q to e along any smooth curve γ(t) running from q to e. Along this curve, we define the transport of ∇ q L at position γ(t) simply as ∇ q L to emulate how the STE would move ∇ q L from q to γ(t). Therefore, the direction and magnitude of ∇ q L does not change along the curve γ(t). An example of this transport is visualized in Figure 20, and in Remark 5, we show this formulation is equivalent to the parallel transport of ∇ q L along any curve γ(t) from q to e with the Levi-Civita connection. Remark 5. The Straight Through Estimator (STE) is equivalent to the parallel transport of ∇ q L along any curve connecting q to e with the identity metric tensor in Cartesian coordinates using the Levi-Civita connection. basis vectors of the tangent plane change along γ(t) to remain "parallel" along the curve:
∇ γ(t) v = ⃗ 0 Parallel Transport Condition
Using the identity metric tensor:
g ij = δ ij = 0 if i ̸ = j 1 if i = j
with the Levi-Civita connection will result in all zero Christoffel symbols:
Γ m ij = 1 2 g mk ( ∂g jk ∂x i + ∂g ik ∂x j - ∂g ij ∂x k ) = 0
where g mk is the m, k entry of inverse metric tensor. Computing the covariant derivative for a general curve γ(t): Considering the i th term in this summation:
0 = (∇ γ(t) v) i = γi ∂ ∂x i (v 1 e 1 + v 2 e 2 + ... + v d e d ) = γi ∂ ∂x i v k e k = γi ∂v k ∂x i e k + v k ∂e k ∂x i = γi ∂v k ∂x i e k + v k Γ m ik e m = γi ∂v k ∂x i e k
For this equation to hold for an arbitrary γ(t), ∂v k ∂x i = 0 for 1 ≤ k, i ≤ d. Therefore, v k must be a constant, and vector fields along curves must be constant to satisfy the parallel transport criteria.
Pulling this back to the STE, holding ∇ q L constant along the curve γ(t) from q to e results in a constant vector field along γ(t). The covariant derivative of this vector field is zero, and therefore the STE parallel transports ∇ q L from q to e.

Section: A.12.3 THE ROTATION TRICK AS PARALLEL TRANSPORT
In this section, we analyze the rotation trick through the lens of geometry. As in Appendix A.12.2, we extend the rotation trick to any smooth curve γ(t) connecting q to e and define the transport of ∇ q L at γ(t) as the rotation trick applied to move ∇ q L from q to γ(t). This definition allows us to use the structure of calculus, without imposing any prohibitive restrictions on the path taken from q to e.
To build visual intuition, Figure 21 illustrates how the rotation trick transforms an initial vector along three different curves γ 1 , γ 2 , γ 3 in both Cartesian coordinates and hyperspherical coordinates with normalized basis vectors. In Cartesian coordinates, the rotation trick changes the components of the basis vectors during transport to follow a rotation; however in normalized hyperspherical coordinates, the components of this vector during transport are constant because the basis vectors themselves rotate. Remark 6. The rotation trick is equivalent to the parallel transport of ∇ q L along any curve connecting q to e with the induced metric in hyperspherical coordinates with the normalized transformation described in Equation (1) using the Levi-Civita connection.
Proof. From Equation (2), the metric tensor in hyperspherical coordinates with normalized basis vectors-equivalently, the cartesian coordinate system with each tangent space rotated to align with the hyperspherical frame at every point-is the identity. Therefore, using the Levi-Civita connection leads to zero Christoffel symbols, and the parallel transport of a vector along any curve keeps the vector constant.
We define T p C as the tangent space of the Cartesian coordinate system at point p and T pH as the tangent space of the hyperspherical coordinate system with normalized basis vectors at point p. It remains to show that for a vector ∇ q L ∈ T q C and corresponding ∇ q L ∈ T q H, the transformation of ∇ q L ∈ T ẽH to T e C will yield R q→e ∇ q L where R q→e is the rotation trick's transformation, i.e. the rotation that rotates q to e.
For a vector ∇ q L in hyperspherical coordinates at point q = (1, θ 1 , θ 2 , ..., θ d-1 ) and using the normalized change-of-basis in Equation (1), the corresponding vector ∇ q L in Cartesian coordinates is:
∇ q L T = ∇ q LT Ĵ-1 q ∇ q L = Ĵq ∇ q L = [R p0→q ] ∇ q L = R θ d-1 R θ d-2 • • • R θ1 ∇ q L
and the corresponding vector ∇ q L at point ẽ is:
∇ e L T = ∇ q LT Ĵ-1 e ∇ e L = Ĵe ∇ q L = [R p0→ẽ ] ∇ q L = [R q→ẽ R p0→q ] ∇ q L = R q→ẽ R p0→q ∇ q L = [R q→ẽ ] ∇ q L
which is exactly how the rotation trick transforms the vector. Informally, "copy-and-pasting" the vector ∇ q L from q to ẽ in hyperspherical coordinates with normalized basis vectors corresponds to rotating ∇ q L by the rotation that aligns q to e in Cartesian coordinates.
In summary, we consider a geometry where the tangent space is spanned by unit norm basis vectors ∂ ∂ r , ∂ ∂ θ1 , ..., ∂ ∂ θd-1 that match the direction of the typical hyperspherical basis vectors ∂ ∂r , ∂ ∂θ 1 , ..., ∂ ∂ θd-1 . The induced metric tensor is the identity, so the parallel transport of a vector along any curve holds its components constant. Converting a vector ∇ q L to this tangent space via the normalized transformation in Equation ( 1), parallel transporting the resulting vector from q to ẽ, and then converting it back to Cartesian coordinates corresponds exactly to the rotation trick's transformation. This is a remarkably simple result; the rotation trick and the STE can be viewed as the same operation. Both parallel transport the gradient ∇ q L from q to e in a path-independent manner with the Euclidean metric. The only difference is the coordinate system where parallel transport occurs. The STE employs the Cartesian coordinate system while the rotation trick uses the hyperspherical coordinate system with normalized basis vectors.

Section: ACKNOWLEDGMENTS
We thank Henry Bosch, Benjamin Spector, Dan Biderman, Jordan Juravsky, Mayee Chen, Owen Dugan, Sabri Eyuboglu, and the Hazy Group as a whole for their invaluable feedback and help during revisions of this work. We gratefully acknowledge the support of NIH under No. U54EB020405 (Mobilize), NSF under Nos. CCF2247015 (Hardware-Aware), CCF1763315 (Beyond Sparsity), CCF1563078 (Volume to Velocity), and 1937301 (RTML); US DEVCOM ARL under Nos. W911NF-23-2-0184 (Long-context) and W911NF-21-2-0251 (Interactive Human-AI Teaming); ONR under Nos. N000142312633 (Deep Signal Processing); Stanford HAI under No. 247183; NXP, Xilinx, LETI-CEA, Intel, IBM, Microsoft, NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Accenture, Ericsson, Qualcomm, Analog Devices, Google Cloud, Salesforce, Total, the HAI-GCP Cloud Credits for Research program, the Stanford Data Science Initiative (SDSI), and members of the Stanford DAWN project: Meta, Google, and VMWare. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views, policies, or endorsements, either expressed or implied, of NIH, ONR, or the U.S. Government.

Section:  
2
 and
Table 3
. ROT is an abbreviation for the rotation trick.
Table 9: Hyperparameters for the experiments in Table 2 andTable 3. We implement the rotation trick in the open source https://github.com/CompVis/taming-transformers for the experiments in Table 2 and implement the rotation trick in https://github.com/CompVis/latent-diffusion for Table 3. In both settings, we use the default hyperparameters. †: 18 epochs for ImageNet and 50 epochs for FFHQ & CelebA-HQ. We also visualize the reconstructions for the TimeSformer-VQGAN trained with the rotation trick and the STE. Figure 17 shows the reconstructions for BAIR Robot Pushing, and Figure 18 shows the   In contrast, VQ-VAEs trained with the rotation trick do not manifest this training instability. Instead, codebook usage is relatively high-at 43% for BAIR Robot Pushing and 30% for UCF101-and the reconstructions accurately match the input, even though both encoder and decoder are very small video models.

Section: Original Video Rotation Trick Reconstructions STE Reconstructions


Section: A.10.5 CREATION OF VORONOI REGION FIGURE
In this section, we describe the creation of Figure 4 as well as the other figures that use this format. For the top-right and bottom partitions, we fix the codebook to a set of preset values and sample pre-quantized points from four different Gaussian distributions. For the pre-quantized points in the top-left partition, we manually set them to form a crescent shape around the codeword.  We similarly fix constant gradient vectors for each partition, and apply them to the pre-quantized points after transformation by the STE, i.e. simply moving the gradient to each pre-quantized point in the quantized region, or by the rotation trick, i.e. rotating the gradient based on the angle between the pre-quantized point and closest codebook vector and rescaling appropriately. We multiply the gradient by a small constant-the learning rate-and then apply the gradient to each pre-quantized point. We repeat the above 25 times, at each point re-computing the angle and magnitude between the pre-quantized point and the codebook vector for the rotation trick update. For simplicity, we do not update the codebook vectors themselves or recompute codebook regions throughout the numerical simulation.

Section: Original Video Rotation Trick Reconstructions STE Reconstructions
A.11 COMPARISON WITHIN GENERATIVE MODELING APPLICATIONS Absent from our work is an analysis on the effect of VQ-VAEs trained with the rotation trick on down-stream generative modeling applications. We see this comparison as outside the scope of this


References:
[b0] Alexei Baevski; Steffen Schneider; Michael Auli (2019). vq-wav2vec: Self-supervised learning of discrete speech representations. 
[b1] Jonathan Baxter (2000). A model of inductive bias learning. Journal of artificial intelligence research
[b2] Yoshua Bengio; Nicholas Léonard; Aaron Courville (2013). Estimating or propagating gradients through stochastic neurons for conditional computation. 
[b3] Gedas Bertasius; Heng Wang; Lorenzo Torresani (2021). Is space-time attention all you need for video understanding. ICML
[b4] Tim Brooks; Bill Peebles; Connor Holmes; Will Depue; Yufei Guo; Li Jing; David Schnurr; Joe Taylor; Troy Luhman; Eric Luhman; Clarence Ng; Ricky Wang; Aditya Ramesh (2024). Video generation models as world simulators. 
[b5] Huiwen Chang; Han Zhang; Lu Jiang; Ce Liu; William T Freeman (2022). Maskgit: Masked generative image transformer. 
[b6] Hang Chen; Sankepally Sainath Reddy; Ziwei Chen; Dianbo Liu (2024). Balance of number of embedding and their dimensions in vector quantization. 
[b7] Vijay Zhao Chen; Chen-Yu Badrinarayanan; Andrew Lee;  Rabinovich (2018). Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. PMLR
[b8] Chung-Cheng Chiu; James Qin; Yu Zhang; Jiahui Yu; Yonghui Wu (2022). Self-supervised learning with random-projection quantizer for speech recognition. PMLR
[b9] M Thomas;  Cover (1999). Elements of information theory. John Wiley & Sons
[b10] Mathieu Dagréou; Pierre Ablin; Samuel Vaiter; Thomas Moreau (2024). How to compute hessianvector products? In ICLR Blogposts. 
[b11] Jia Deng; Wei Dong; Richard Socher; Li-Jia Li; Kai Li; Li Fei-Fei (2009). Imagenet: A large-scale hierarchical image database. Ieee
[b12] Prafulla Dhariwal; Heewoo Jun; Christine Payne; Jong Wook Kim; Alec Radford; Ilya Sutskever (2020). Jukebox: A generative model for music. 
[b13] Xiaoyi Dong; Jianmin Bao; Ting Zhang; Dongdong Chen; Weiming Zhang; Lu Yuan; Dong Chen; Fang Wen; Nenghai Yu; Baining Guo (2023). Peco: Perceptual codebook for bert pre-training of vision transformers. 
[b14] Alexey Dosovitskiy (2020). An image is worth 16x16 words: Transformers for image recognition at scale. 
[b15] Frederik Ebert; Chelsea Finn; Alex X Lee; Sergey Levine (2017). Self-supervised visual planning with temporal skip connections. CoRL
[b16] Patrick Esser; Robin Rombach; Bjorn Ommer (2021). Taming transformers for high-resolution image synthesis. 
[b17] Tanmay Gautam; Reid Pryzant; Ziyi Yang; Chenguang Zhu; Somayeh Sojoudi (2023). Soft convex quantization: Revisiting vector quantization with convex optimization. 
[b18] Nabarun Goswami; Yusuke Mukuta; Tatsuya Harada (2024). Hypervq: Mlr-based vector quantization in hyperbolic space. 
[b19] Robert Gray (1984). Vector quantization. IEEE Assp Magazine
[b20] Martin Heusel; Hubert Ramsauer; Thomas Unterthiner; Bernhard Nessler; Sepp Hochreiter (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems
[b21] Geoffrey E Hinton; Richard Zemel (1993). Autoencoders, minimum description length and helmholtz free energy. Advances in neural information processing systems
[b22] Mengqi Huang; Zhendong Mao; Zhuowei Chen; Yongdong Zhang (2023). Towards accurate image coding: Improved autoregressive image generation with dynamic vector quantization. 
[b23] Minyoung Huh; Brian Cheung; Pulkit Agrawal; Phillip Isola (2023). Straightening out the straightthrough estimator: Overcoming optimization challenges in vector quantized networks. PMLR
[b24] Phillip Isola; Jun-Yan Zhu; Tinghui Zhou; Alexei A Efros (2017). Image-to-image translation with conditional adversarial networks. 
[b25] Eric Jang; Shixiang Gu; Ben Poole (2016). Categorical reparameterization with gumbel-softmax. 
[b26] Justin Johnson; Alexandre Alahi; Li Fei-Fei (2016). Perceptual losses for real-time style transfer and super-resolution. Springer
[b27] Tero Karras (2017). Progressive growing of gans for improved quality, stability, and variation. 
[b28] Tero Karras; Samuli Laine; Timo Aila (2019). A style-based generator architecture for generative adversarial networks. 
[b29] Alex Kendall; Yarin Gal; Roberto Cipolla (2018). Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. 
[b30] P Diederik; Max Kingma;  Welling (2013). Auto-encoding variational bayes. 
[b31] Alexander Kolesnikov; André Susano Pinto; Lucas Beyer; Xiaohua Zhai; Jeremiah Harmsen; Neil Houlsby (2022). Uvim: A unified modeling approach for vision with learned guiding codes. Advances in Neural Information Processing Systems
[b32] Adrian Łańcucki; Jan Chorowski; Guillaume Sanchez; Ricard Marxer; Nanxin Chen; Jga Hans; Sameer Dolfing; Tanel Khurana; Antoine Alumäe;  Laurent (2020). Robust training of vector quantized bottleneck models. IEEE
[b33] Doyup Lee; Chiheon Kim; Saehoon Kim; Minsu Cho; Wook-Shin Han (2022). Autoregressive image generation using residual quantization. 
[b34] Fabian Mentzer; David Minnen; Eirikur Agustsson; Michael Tschannen (2023). Finite scalar quantization: Vq-vae made simple. 
[b35] Robin Rombach; Andreas Blattmann; Dominik Lorenz; Patrick Esser; Björn Ommer (2022). Highresolution image synthesis with latent diffusion models. 
[b36] Tim Salimans; Ian Goodfellow; Wojciech Zaremba; Vicki Cheung; Alec Radford; Xi Chen (2016). Improved techniques for training gans. Advances in neural information processing systems
[b37]  Soomro (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. 
[b38] Yuhta Takida; Takashi Shibuya; Weihsiang Liao; Chieh-Hsin Lai; Junki Ohmura; Toshimitsu Uesaka; Naoki Murata; Shusuke Takahashi; Toshiyuki Kumakura; Yuki Mitsufuji (2022). Sq-vae: Variational bayes on discrete representation with self-annealed stochastic quantization. 
[b39] Thomas Unterthiner; Sjoerd Van Steenkiste; Karol Kurach; Raphael Marinier; Marcin Michalski; Sylvain Gelly (2018). Towards accurate generative models of video: A new metric & challenges. 
[b40] Aaron Van Den; Oriol Oord;  Vinyals (2017). Neural discrete representation learning. Advances in neural information processing systems
[b41]  Vaswani (2017). Attention is all you need. 
[b42] Wilson Yan; Yunzhi Zhang; Pieter Abbeel; Aravind Srinivas (2021). Videogpt: Video generation using vq-vae and transformers. 
[b43] Jiahui Yu; Xin Li; Jing Yu Koh; Han Zhang; Ruoming Pang; James Qin; Alexander Ku; Yuanzhong Xu; Jason Baldridge; Yonghui Wu (2021). Vector-quantized image modeling with improved vqgan. 
[b44] Lijun Yu; José Lezama; B Nitesh; Luca Gundavarapu; Kihyuk Versari; David Sohn; Yong Minnen; Agrim Cheng; Xiuye Gupta; Alexander G Gu;  Hauptmann (2023). Language model beats diffusiontokenizer is key to visual generation. 
[b45] Jiahui Zhang; Fangneng Zhan; Christian Theobalt; Shijian Lu (2023). Regularized vector quantization for tokenized image synthesis. 
[b46] Yue Zhao; Yuanjun Xiong; Philipp Krähenbühl (2024). Image and video tokenization with binary spherical quantization. 
[b47] Chuanxia Zheng; Andrea Vedaldi (2023). Online clustered codebook. 
[b48] Zixin Zhu; Xuelu Feng; Dongdong Chen; Jianmin Bao; Le Wang; Yinpeng Chen; Lu Yuan; Gang Hua (2023). Designing a better asymmetric vqgan for stablediffusion. 

Figures:
Figure fig_0: 2
Type: figure
Caption: Figure 2 :2Figure 2: Visualization of how the straight-through estimator (STE) transforms the gradient field for 16 codebook vectors for (top) f (x, y) = x 2 + y 2 and (bottom) f (x, y) = log | 1 2 x + tanh(y)| . The STE takes the gradient at the codebook vector (qx, qy) and "copies-and-pastes" it to all other locations within the same codebook region, forming a "checker-board" pattern in the gradient field. function Q(•). We can break down the backward pass into three terms: ∂L ∂x = ∂L ∂q ∂q ∂e ∂e ∂x
Data: 

Figure fig_1: 3
Type: figure
Caption: Figure 3 :3Figure 3: Illustration of how the gradient at q
Data: 

Figure fig_2: 4
Type: figure
Caption: Figure 4 :4Figure 4: Depiction of how points within the same codebook region change after a gradient update (red arrow) at the codebook vector (orange circle). The STE applies the same update to each point in the same region. The rotation trick modifies the update based on the location of each point with respect to the codebook vector.
Data: 

Figure fig_3: 5
Type: figure
Caption: Figure 5 :5Figure 5: With the STE, the distances among points within the same region do not change. However with the rotation trick, the distances among points do change. When ϕ < π/2, points with large angular distance are pushed away (blue: increasing distance). When ϕ > π/2, points are pulled towards the codebook vector (green: decreasing distance).
Data: 

Figure fig_4: 7
Type: figure
Caption: Figure 7 :7Figure 7: Loss surface for Himmelblau's function. Himmelblau's function has four equal local minima: f (3.0, 2.0) = 0.0, f (-2.8.., 3.1...) = 0.0, f (-3.7.., -3.2..) = 0.0, and f (3.5.., -1.8..) = 0.0.
Data: 

Figure fig_5: 8
Type: figure
Caption: Figure 88Figure 8 visualizes our results after 33, 66, and 100 gradient updates. The orange circles represent codebook vectors, the green dots the initial points, and the blue dots the updated points. Contour lines are drawn in each diagram to indicate regions of equal loss, with blue representing regions of low loss and red indicating regions of high loss.Similar to our findings in Section 5, we see that the rotation trick clusters points more tightly around each codebook vector when compared to the STE, resulting in lower distortion. Moreover, the codebook vectors more rapidly converge to the four equal local minima in Himmelblau's function, resulting in a lower objective function value when averaged across all points.
Data: 

Figure fig_6: 8
Type: figure
Caption: Figure 8 :8Figure8: Synthetic experiment for minimizing Himmelblau's function with vector quantization using the STE gradient estimator (top row) and the rotation trick (bottom row). The rotation trick more quickly converges to these minima and achieves substantively lower distortion between codewords and pre-quantized points.
Data: 

Figure fig_7: 9
Type: figure
Caption: Figure 9 :9Figure 9: Examples of how the gradient can change due to the presence of negative curvature or an indefinite
Data: 

Figure fig_8: 10
Type: figure
Caption: Figure 10 :10Figure 10: Depiction of how points within the same codebook region change after a gradient update (red arrow)
Data: 

Figure fig_9: 1
Type: figure
Caption: Remark 1 .1Let a, b ∈ R d that define hyperplanes a ⊥ and b ⊥ respectively. Then a reflection across a ⊥ followed by a reflection across b ⊥ is a rotation of 2θ in the plane spanned by a, b where θ is the angle between a, b. Remark 2. Let a, b ∈ R d with ∥a∥ = ∥b∥ = 1. Define c = a+b ∥a+b∥ as the vector half-way between a and b so that ∠(a, b) = θ and ∠(a, c) = ∠(b, c) = θ 2 . From Definition 1, (I -2cc T ) encodes a reflection across c ⊥ and (I -2bb T ) encodes a reflection across b ⊥ . From Remark 1, (I -2bb T )(I -2cc T ) then corresponds to a rotation of 2( θ 2 ) = θ in the plane spanned by b and c. As the span(b, c) = span(a, b), (I -2bb T )(I -2cc T ) corresponds to a rotation of θ in the plane spanned by a and b. Therefore, (I -2bb T )(I -2cc T )a = b.
Data: 

Figure fig_11: 13
Type: figure
Caption: Figure 13 :13Figure 13: Depiction of how points within the same codebook region change after a gradient update (red arrow) at the codebook vector (orange circle). The STE applies the same update to each point in the same region. The reflection trick (Appendix A.8) modifies the update based on the location of each point with respect to the codebook vector. Note the top-left region of the reflection trick update, where the points actually move in the opposite direction of the gradient update.
Data: 

Figure fig_12: 
Type: figure
Caption: x= r cos θ y = r sin θ cos ϕ z = r sin θ sin ϕ More generally, hyperspherical coordinates are composed by a radial coordinate r and d -1 angular coordinates θ 1 , ..., θ d-1 where θ 1 , ...θ d-2 are supported over [0, π] while θ d -1 ranges from [0, 2π].
Data: 

Figure fig_13: 19
Type: figure
Caption: Figure 19 :19Figure 19: Visualization of basis vectors at different points under Cartesian (left) and spherical (right) coordinatate systems. Notice that the Cartesian basis vectors do not change from point-to-point; however, the spherical basis vectors change in both direction and magnitude. Even at the same radius, the ∂ ∂ϕ coordinate changes based on the azimuth angle θ because the same infinitesimal change in ϕ will result in a longer (or smaller) change in arclength depending on the radius of the circle at latitude θ.
Data: 

Figure fig_14: 
Type: figure
Caption: Figure 20: (top) Visualization of vector transport in Cartesian coordinates and renormalized hyperspherical coordinates along curves γ1(t), γ2(t) and γ3(t). Notice the hyperspherical basis changes from point to point. (bottom) Depiction of the transported vector in terms of the basis vectors ∂ ∂x and ∂ ∂y for Cartesian coordinates and ∂ ∂ r and ∂ ∂ θ for hyperspherical coordinates. Notice how the components of ∂ ∂ r and ∂ ∂ θ change for a constant vector field in the Cartesian tangent space.
Data: 

Figure fig_15: 
Type: figure
Caption: ∇γ(t) v = ∇ γ1e1+ γ2e2+...+ γd e d v v 1 e 1 + v 2 e 2 + ... + v d e d )must be equal to 0 for parallel transport
Data: 

Figure fig_16: 
Type: figure
Caption: Figure 21: (top) Visualization of vector transport in hyperspherical coordinates with normalized basis vectors and Cartesian coordinates along curves γ1(t), γ2(t) and γ3(t). The vectors along each curve in hyperspherical coordinates rotate to stay constant with respect to the natural rotation of the basis vectors. This same rotation in Cartesian coordinates yields a non-constant vector as the Cartesian basis vectors do not change from point to point. (bottom) Depiction of the transported vector in terms of the basis vectors ∂ ∂ r and ∂ ∂ θ for hyperspherical coordinates and ∂ ∂x and ∂ ∂y for Cartesian coordinates.In the former case, the transported vector remains constant with respect to the normalized basis vectors, while in Cartesian coordinates, the components change along γ3(t).
Data: 

Figure tab_0: 1
Type: table
Caption: Comparison of VQ-VAEs trained on ImageNet followingVan Den Oord et al. (2017). We use the Vector Quantization layer from https://github.com/lucidrains/vector-quantize-pytorch.
Data: ApproachTraining MetricsValidation MetricsVQ-VAE100%0.1075.9e-30.115106.111.7VQ-VAE w/ Rotation Trick97%0.1165.1e-40.12285.717.0Codebook Lookup: Cosine & Latent Shape: 32 × 32 × 32 & Codebook Size: 1024VQ-VAE75%0.1072.9e-30.11484.317.7VQ-VAE w/ Rotation Trick91%0.1052.7e-30.11182.918.1Codebook Lookup: Euclidean & Latent Shape: 64 × 64 × 3 & Codebook Size: 8192VQ-VAE100%0.0281.0e-30.03019.097.3Gumbel VQ-VAE39%0.054-0.05828.674.9VQ-VAE w/ Hessian Approx.39%0.0826.9e-50.11235.665.1VQ-VAE w/ Exact Gradients84%0.0502.0e-30.05325.480.4VQ-VAE w/ Rotation Trick99%0.0281.4e-40.03016.5106.3Codebook Lookup: Cosine & Latent Shape: 64 × 64 × 3 & Codebook Size: 8192VQ-VAE31%0.0341.2e-40.03826.077.8VQ-VAE w/ Hessian Approx.37%0.0353.8e-50.03729.071.5VQ-VAE w/ Exact Gradients38%0.0353.6e-50.03728.275.0VQ-VAE w/ Rotation Trick38%0.0339.6e-50.03524.283.9

Figure tab_1: 2
Type: table
Caption: Results for VQGAN designed for autoregressive generation as implemented in https://github.com/ CompVis/taming-transformers. Experiments on ImageNet and the combined dataset FFHQ(Karras et al., 2019) and CelebA-HQ(Karras, 2017) use a latent bottleneck of dimension 16 × 16 × 256 with 1024 codebook vectors.
Data: ApproachDatasetCodebook Usage Quantization Error (↓) Valid Loss (↓) r-FID (↓) r-IS (↑)VQGAN (reported)ImageNet---7.9114.4VQGAN (our run)ImageNet95%0.1340.5947.3118.2VQGAN w/ Rotation TrickImageNet98%0.0020.4224.6146.5VQGANFFHQ & CelebA-HQ27%0.2330.5654.75.0VQGAN w/ Rotation Trick FFHQ & CelebA-HQ99%0.0020.3133.75.2

Figure tab_2: 3
Type: table
Caption: Results for VQGAN designed for latent diffusion as implemented in https://github.com/CompVis/ latent-diffusion. Both settings train on ImageNet.
Data: ApproachLatent Shape Codebook Size Codebook Usage Quantization Error (↓) Valid Loss (↓) r-FID (↓) r-IS (↑)VQGAN64 × 64 × 3819215%2.5e-30.1830.53220.6Gumbel VQGAN64 × 64 × 381924%-0.1970.60219.7VQGAN w/ Rotation Trick 64 × 64 × 3819286%1.7e-40.1420.27228.0VQGAN32 × 32 × 4163842%1.2e-20.3855.0141.5Gumbel VQGAN32 × 32 × 41638412%-0.30311.7189.5VQGAN w/ Rotation Trick 32 × 32 × 41638427%2.4e-40.2691.1200.2

Figure tab_3: 4
Type: table
Caption: Results for ViT-VQGAN(Yu et al., 2021) trained on ImageNet. The latent shape is 8 × 8 × 32 with 8192 codebook vectors. r-FID and r-IS are reported on the validation set.
Data: ApproachCodebook Usage (↑) Train Loss (↓) Quantization Error (↓) Valid Loss (↓) r-FID (↓) r-IS (↑)ViT-VQGAN [reported]----22.872.9ViT-VQGAN [ours]0.3%0.1246.7e-30.12729.243.0ViT-VQGAN w/ Rotation Trick2.2%0.1138.3e-30.11311.293.1

Figure tab_4: 
Type: table
Caption: rather than CNN to parameterize the encoder and decoder. The ViT-VQGAN uses factorized codes and L 2 normalization on the output and input to the vector quantization layer to improve performance and training stability. Additionally, the authors change the training objective, adding a logit-laplace loss and restoring the L 2 reconstruction error to L VQGAN .Experimental Settings. We follow the open source implementation of https://github.com/thuanz123/ enhancing-transformers and use the default model and hyperparameter settings for the small ViT-VQGAN. A complete description of the training settings can be found in Table10of the Appendix.
Data: 

Figure tab_5: 5
Type: table
Caption: Results for TimeSformer-VQGAN trained on BAIR and UCF-101 with 1024 codebook vectors. †: model suffers from codebook collapse and diverges. r-FVD is computed on the validation set.
Data: ApproachDatasetCodebook Usage Train Loss (↓) Quantization Error (↓) Valid Loss (↓) r-FVD (↓)TimeSformer  †BAIR0.4%0.2210.030.281661.1TimeSformer w/ Rotation TrickBAIR43%0.0743.0e-30.07421.4TimeSformer  †UCF-1010.1%0.1900.0060.1692878.1TimeSformer w/ Rotation Trick UCF-10130%0.1110.0200.109229.1

Figure tab_6: 
Type: table
Caption: video model. Due to compute limitations, both encoder and decoder follow a relatively small TimeSformer model: 8 layers, 256 hidden dimensions, 4 attention heads, and 768 MLP hidden dimensions. A complete description of the architecture, training settings, and hyperparameters are provided in Appendix A.10.4.
Data: 

Figure tab_8: 
Type: table
Caption: Rotation Trick Function Latent Shape Codebook Size Codebook Usage Quantization Error (↓) Valid Loss (↓) r-FID (↓) r-IS (↑)
Data: ∥q∥ ∥e∥ Re64 × 64 × 3819245%4.0e-40.1610.46225.0Re -(q -Re)64 × 64 × 3819228%1.5e-30.1830.6220.0∥q∥ ∥e∥ Re32 × 32 × 41638418%3.3e-40.2921.5196.1Re -(q -Re)32 × 32 × 41638413%9.4e-40.2921.5191.5


Formulas:
Formula formula_0: Q(q = i|e) = 1 if i = arg min 1≤j≤|C| ∥e -q j ∥ 2 0 otherwise

Formula formula_1: L(x) = ∥x -x∥ 2 2 + ∥sg(e) -q∥ 2 2 + β∥e -sg(q)∥ 2 2

Formula formula_2: ∂L ∂x = ∂L ∂q I ∂e ∂x

Formula formula_3: L e ≈ L q + (∇ q L) T (e -q) + 1 2 (e -q) T (∇ 2 q L)(e -q)

Formula formula_4: ∂L ∂e ≈ ∂ ∂e L q + (∇ q L) T (e -q) + 1 2 (e -q) T (∇ 2 q L)(e -q) = ∇ q L + (∇ 2 q L)(e -q)

Formula formula_5: ∂ q ∂e = ∥q∥ ∥e∥ R

Formula formula_6: q = λRe = λ(I -2rr T + 2qê T )e = λ[e -2rr T e + 2qê T e]

Formula formula_7: L = ∥x -x∥ 2 2 + ∥sg(e) -q∥ 2 2 + β∥e -sg(q)∥ 2 2

Formula formula_8: L VQGAN = L Per + ∥sg(e) -q∥ 2 2 + β∥e -sg(q)∥ 2 2 + λL Adv

Formula formula_9: {L e } Hessian = L q + (∇ q L) T (e -q) + 1 2 (e -q) T (∇ 2 q L)(e -q)

Formula formula_10: ∂L e ∂e -{ ∂L e ∂e } Hessian = ∂ ∂e O(∥e -q∥ 3 )

Formula formula_11: ∥q -e∥ 2 = ∥q∥ 2 + ∥e∥ 2 -2∥q∥∥e∥ cos(θ)

Formula formula_12: ∥q -ê∥ 2 = ∥q -e∥ 2 = ∥q∥ 2 + ∥ê∥ 2 -2∥q∥∥ê∥ cos θ

Formula formula_13: cos θ = ∥q∥ 2 + ∥e∥ 2 -2∥q∥∥e∥ cos(θ) -∥q + d∥ 2 -∥e + d∥ 2 -2∥q + d∥∥e + d∥

Formula formula_14: cos θ ≈ -2∥d∥ 2 -2∥d∥ 2 = 1

Formula formula_15: R = (I -2qq T )(I -2rr T ) = I -2qq T -2rr T + 4qq T rr T = I -2qq T -2rr T + 4q q T r r T = I -2qq T -2rr T + 4q q T e + q ∥e + q∥ r T = I -2qq T -2rr T 4q q T e + q T q ∥e + q∥ r T = I -2qq T -2rr T + 4q ∥q∥∥e∥ cos θ + ∥q∥∥q∥ ∥e + q∥ r T = I -2qq T -2rr T + 4q cos θ + 1 ∥e + q∥ r T = I -2qq T -2rr T + 4q ∥e + q∥ 2 2∥e + q∥ r T = I -2qq T -2rr T + 4∥e + q∥ 2 2∥e + q∥ qr T = I -2qq T -2rr T + 4∥e + q∥ 2 2∥e + q∥ 2 q(e + q) T = I -2qq T -2rr T + 2qe T + 2qq T = I -2rr T + 2qe T

Formula formula_16: q = eR T ∂q ∂e = R

Formula formula_17: ∇ e L = ∇ q L ∂q ∂e = ∇ q L [R]

Formula formula_18: ∥∇ q L∥ cos θ = q [∇ q L] T = eR T [∇ q L] T = e [∇ q LR] T = e [∇ e L] T

Formula formula_19: q = ||q|| ||e|| Re

Formula formula_20: f (e) = ∥Q(e)∥ ∥e∥ I -2 e + Q(e) ∥e + Q(e)∥ e + Q(e) ∥e + Q(e)∥ T + 2Q(e)e T = ∥q∥ ∥e∥ R

Formula formula_21: q = f (e)e

Formula formula_22: ∂ q ∂e = f ′ (e)e + f (e)

Formula formula_23: q = R constant e + (q -Re) constant

Formula formula_24: ∥e -q∥ = ∥e∥ 2 + ∥q∥ 2 -2∥e∥∥q∥ cos θ

Formula formula_25: q = γ(e)Re + (q -γ(e)Re)

Formula formula_26: γ(e) = 1 8∥q -e∥ 2

Formula formula_27: x d-1 = r sin(θ 1 ) • • • sin(θ d-2 ) cos(θ d-1 ) x d = r sin(θ 1 ) • • • sin(θ d-2 ) sin(θ d-1 )

Formula formula_28: r = (x 1 ) 2 + (x 2 ) 2 + ... + (x d ) 2 θ 1 = arctan 2( (x d ) 2 + ... + (x 2 ) 2 , x 1 ) θ 2 = arctan 2( (x d ) 2 + ... + (x 3 ) 2 , x 2 )

Formula formula_29: θ d-2 = arctan 2( (x d ) 2 + (x d-1 ) 2 , x d-2 ) θ d-1 = arctan 2( (x d ) 2 , x d-1 )

Formula formula_30: ∂ ∂θ i = d k=1 ∂x k ∂θ i ∂ ∂x k

Formula formula_31: ∂ ∂x 1 ∂ ∂x 2 • • • ∂ ∂x d       ∂x 1 ∂r ∂x 1 ∂θ 1 • • • ∂x 1 ∂θ d-1 ∂x 2 ∂r ∂x 2 ∂θ 1 • • • ∂x 2 ∂θ d-1 . . . . . . . . . . . . ∂x d ∂r ∂x d ∂θ 1 • • • ∂x d ∂θ d-1       The Jacobian J = ∂ ∂r ∂ ∂θ 1 • • • ∂ ∂θ d-1

Formula formula_32: ds 2 = (dx 1 ) 2 + (dx 2 ) 2 + ... + (dx d ) 2

Formula formula_33: ds 2 = dr 2 + r 2 (dθ 1 ) 2 + r 2 sin 2 θ 1 (dθ 1 ) 2 + r 2 d-1 i=2 sin 2 θ i (dθ d-1 ) 2

Formula formula_34: ∥ ∂ ∂θ i ∥ = ⟨ ∂ ∂θ i , ∂ ∂θ i ⟩ = d k=1 ∂x k ∂θ i ∂ ∂x k •   d j=1 ∂x j ∂θ i ∂ ∂x j   = d k=1 ∂x k ∂θ i ∂x k ∂θ i ∂ ∂x k • ∂ ∂x k = d k=1 ( ∂x k ∂θ i ) 2

Formula formula_35: I =        1 2 0 0 ... 0 0 r 2 0 ... 0 0 0 r 2 sin 2 θ 1 ... 0 . . . . . . . . . . . . . . . 0 0 0 ... r 2 d-1 i=1 sin 2 θ i       

Formula formula_36: ∥ ∂ ∂θ i ∥ = ⟨ ∂ ∂θ i , ∂ ∂θ i ⟩. Therefore, our normalized hyperspherical basis vectors ∂ ∂ r , ∂ ∂ θ1

Formula formula_37: ∂ ∂ r = ∂ ∂r ∂ ∂ θi = (I ii ) -1 2 ∂ ∂θ i

Formula formula_38: ∂ ∂ θi = (I ii ) -1 2 d k=1 ∂x k ∂θ i ∂ ∂x k

Formula formula_39: ∂ ∂x 1 ∂ ∂x 2 • • • ∂ ∂x d        ∂x 1 ∂ r ∂x 1 ∂ θ1 • • • ∂x 1 ∂ θd-1 ∂x 2 ∂ r ∂x 2 ∂ θ1 • • • ∂x 2 ∂ θd-1 . . . . . . . . . . . . ∂x d ∂ r ∂x d ∂ θ1 • • • ∂x d ∂ θd-1        Ĵ∈SO(d) = ∂ ∂ r ∂ ∂ θ1 • • • ∂ ∂ θd-1 (1)

Formula formula_40: Ĵv = (R p0→ p)v = (R x d-1 -x d θ d • • • R x 2 -x 3 θ2 R x 1 -x 2 θ1 )v Ĵ-1 v = ĴT v = (R p0→ p) T v = (R x d-1 -x d θ d • • • R x 2 -x 3 θ2 R x 1 -x 2 θ1 ) T v = R p→ p0 v

Formula formula_41: v T p R x d-1 -x d θ d • • • R x 2 -x 3 θ2 R x 1 -x 2 θ1 = ṽT p R x d-1 -x d θ d • • • R x 2 -x 3 θ2 R x 1 -x 2 θ1 T v p = ṽp [R p→ p0 ] v p = ṽp

Formula formula_42: Î =         ∂ ∂ r • ∂ ∂ r 0 0 ... 0 0 ∂ ∂ θ1 • ∂ ∂ θ1 0 ... 0 0 0 ∂ ∂ θ2 • ∂ ∂ θ2 ... 0 . . . . . . . . . . . . . . . 0 0 0 ... ∂ ∂ θd-1 • ∂ ∂ θd-1         =         (I 11 ) -1 ∂ ∂r • ∂ ∂r 0 0 ... 0 0 (I 22 ) -1 ∂ ∂θ1 • ∂ ∂θ1 0 ... 0 0 0 (I 33 ) -1 ∂ ∂θ2 • ∂ ∂θ2 ... 0 . . . . . . . . . . . . . . . 0 0 0 ... (I dd ) -1 ∂ ∂θ d-1 • ∂ ∂θ d-1         =       (I 11 ) -1 (I 11 ) 0 0 ... 0 0 (I 22 ) -1 (I 22 ) 0 ... 0 0 0 (I 33 ) -1 (I 33 ) ... 0 . . . . . . . . . . . . . . . 0 0 0 ... (I dd ) -1 (I dd )       =       1 0 0 ... 0 0 1 0 ... 0 0 0 1 ... 0 . . . . . . . . . . . . . . . 0 0 0 ... 1      (2)

Formula formula_43: ∂ ∂ θi • ∂ ∂ θi = 1.

Formula formula_44: ∇ γ(t) v = ⃗ 0 Parallel Transport Condition

Formula formula_45: g ij = δ ij = 0 if i ̸ = j 1 if i = j

Formula formula_46: Γ m ij = 1 2 g mk ( ∂g jk ∂x i + ∂g ik ∂x j - ∂g ij ∂x k ) = 0

Formula formula_47: 0 = (∇ γ(t) v) i = γi ∂ ∂x i (v 1 e 1 + v 2 e 2 + ... + v d e d ) = γi ∂ ∂x i v k e k = γi ∂v k ∂x i e k + v k ∂e k ∂x i = γi ∂v k ∂x i e k + v k Γ m ik e m = γi ∂v k ∂x i e k

Formula formula_48: ∇ q L T = ∇ q LT Ĵ-1 q ∇ q L = Ĵq ∇ q L = [R p0→q ] ∇ q L = R θ d-1 R θ d-2 • • • R θ1 ∇ q L

Formula formula_49: ∇ e L T = ∇ q LT Ĵ-1 e ∇ e L = Ĵe ∇ q L = [R p0→ẽ ] ∇ q L = [R q→ẽ R p0→q ] ∇ q L = R q→ẽ R p0→q ∇ q L = [R q→ẽ ] ∇ q L

