['3c3', '< Abstract: Vector Quantized Variational AutoEncoders (VQ-VAEs) are designed to compress a continuous input to a discrete latent space and reconstruct it with minimal distortion. They operate by maintaining a set of vectors-often referred to as the codebook-and quantizing each encoder output to the nearest vector in the codebook. However, as vector quantization is non-differentiable, the gradient to the encoder flows around the vector quantization layer rather than through it in a straight-through approximation. This approximation may be undesirable as all information from the vector quantization operation is lost. In this work, we propose a way to propagate gradients through the vector quantization layer of VQ-VAEs. We smoothly transform each encoder output into its corresponding codebook vector via a rotation and rescaling linear transformation that is treated as a constant during backpropagation. As a result, the relative magnitude and angle between encoder output and codebook vector becomes encoded into the gradient as it propagates through the vector quantization layer and back to the encoder. Across 11 different VQ-VAE training paradigms, we find this restructuring improves reconstruction metrics, codebook utilization, and quantization error. Our code is available at https://github.com/cfifty/rotation_trick.', '---', '> Abstract: Vector Quantized Variational AutoEncoders (VQ-VAEs) are powerful generative models that compress continuous inputs into discrete latent representations using a codebook. A critical challenge in VQ-VAE training is the non-differentiable nature of the vector quantization step, typically addressed by a Straight-Through Estimator (STE). However, the STE completely bypasses the quantization operation, leading to a loss of crucial geometric information and often resulting in sub-optimal performance, training instabilities, and under-utilized codebooks. In this work, we introduce the **Rotation Trick**, a novel and geometrically-informed method for propagating gradients through the vector quantization layer. Our approach smoothly transforms each encoder output into its corresponding codebook vector via a rotation and rescaling linear transformation. Crucially, this transformation is treated as a constant during backpropagation, allowing the relative magnitude and angle between the encoder output and its chosen codebook vector to be explicitly encoded into the gradient. This rich geometric information then propagates back to the encoder, enabling more effective and stable updates. We rigorously evaluate the Rotation Trick across 11 diverse VQ-VAE training paradigms and consistently demonstrate significant improvements in reconstruction metrics, codebook utilization, and quantization error. Our code is available at https://github.com/cfifty/rotation_trick.', '6,10c6', '< Vector quantization (Gray, 1984) is an approach to discretize a continuous vector space. It defines a finite set of vectors-referred to as the codebook-and maps any vector in the continuous vector space to the closest vector in the codebook. However, deep learning paradigms that use vector quantization are often difficult to train because replacing a vector with its closest codebook counterpart is a nondifferentiable operation (Huh et al., 2023). This characteristic was not an issue at its creation during the Renaissance of Information Theory for applications like noisy channel communication (Cover, 1999); however in the era deep learning, it presents a challenge as gradients cannot directly flow through layers that use vector quantization during backpropagation.', "< In deep learning, vector quantization is largely used in the eponymous Vector Quantized-Variational AutoEncoder (VQ-VAE) (Van Den Oord et al., 2017). A VQ-VAE is an AutoEncoder with a vector quantization layer between the encoder's output and decoder's input, thereby quantizing the learned representation at the bottleneck. While VQ-VAEs are ubiquitous in state-of-the-art generative modeling (Rombach et al., 2022;Dhariwal et al., 2020;Brooks et al., 2024), their gradients cannot flow from the decoder to the encoder uninterrupted as they must pass through a non-differentiable vector quantization layer.", '< A solution to the non-differentiability problem is to approximate gradients via a "straight-through estimator" (STE) (Bengio et al., 2013). During backpropagation, the STE copies and pastes the gradients from the decoder\'s input to the encoder\'s output, thereby skipping the quantization operation altogether. However, this approximation can lead to poor-performing models and codebook collapse: a phenomena where a large percentage of the codebook converge to zero norm and are unused by the model (Mentzer et al., 2023). Even if codebook collapse does not occur, the codebook is often under-utilized, thereby limiting the information capacity of the VQ-VAEs\'s bottleneck (Dhariwal et al., 2020). instabilities caused by the vector quantization layer. We partition these efforts into two categories: (1) methods that sidestep the STE and (2) methods that improve codebook-model interactions.', '< Sidestepping the STE. Several prior works have sought to fix the problems caused by the STE by avoiding deterministic vector quantization. Baevski et al. (2019) employ the Gumbel-Softmax trick (Jang et al., 2016) to fit a categorical distribution over codebook vectors that converges to a one-hot distribution towards the end of training, Gautam et al. (2023) quantize using a convex combination of codebook vectors, and Takida et al. (2022) employ stochastic quantization. Unlike the above that cast vector quantization as a distribution over codebook vectors, Huh et al. (2023) propose an alternating optimization where the encoder is optimized to output representations close to the codebook vectors while the decoder minimizes reconstruction loss from a fixed set of codebook vector inputs. While these approaches sidestep the training instabilities caused by the STE, they can introduce their own set of problems and complexities such as low codebook utilization at inference and the tuning of a temperature schedule (Zhang et al., 2023). As a result, many applications and research papers continue to employ VQ-VAEs that are trained using the STE (Rombach et al., 2022;Chang et al., 2022;Huang et al., 2023;Zhu et al., 2023;Dong et al., 2023).', '< Codebook-Model Improvements. Another way to attack codebook collapse or under-utilization is to change the codebook lookup. Rather than use Euclidean distance, Yu et al. (2021) employ a cosine similarity measure, Goswami et al. (2024) a hyperbolic metric, and Lee et al. (2022) stochastically sample codes as a function of the distance between the encoder output and codebook vectors. Another perspective examines the learning of the codebook. Kolesnikov et al. (2022) split high-usage codebook vectors, Dhariwal et al. (2020); Łańcucki et al. (2020); Zheng & Vedaldi (2023) resurrect low-usage codebook vectors throughout training, Chen et al. (2024) dynamically selects one of m codebooks for each datapoint, and Mentzer et al. (2023); Zhao et al. (2024); Yu et al. (2023); Chiu et al. (2022) fix the codebook vectors to an a priori geometry and train the model without learning the codebook at all. Other works propose loss penalties to encourage codebook utilization. Zhang et al. (2023) add a KL-divergence penalty between codebook utilization and a uniform distribution while Yu et al. (2023) add an entropy loss term to penalize low codebook utilization. While effective at targeting specific training difficulties, as each of these methods continue to use the STE, the training instability caused by this estimator persist. Most of our experiments in Section 5 implement a subset of these approaches, and we find that replacing the STE with the rotation trick further improves performance.', '---', '> Vector quantization (VQ) (Gray, 1984) is a fundamental technique for discretizing continuous vector spaces, mapping vectors to the nearest element within a finite set known as a codebook. While historically crucial in information theory (Cover, 1999), its non-differentiable nature poses a significant obstacle for modern deep learning architectures, preventing direct gradient flow during backpropagation (Huh et al., 2023). This challenge is particularly acute in Vector Quantized-Variational AutoEncoders (VQ-VAEs) (Van Den Oord et al., 2017), where a VQ layer quantizes the learned representation at the bottleneck. Despite their ubiquity in state-of-the-art generative modeling (Rombach et al., 2022; Dhariwal et al., 2020; Brooks et al., 2024), VQ-VAEs struggle with uninterrupted gradient flow from the decoder to the encoder due to this non-differentiable layer.', '11a8,12', '> The prevailing solution is the "straight-through estimator" (STE) (Bengio et al., 2013), which approximates gradients by copying them directly from the decoder\'s input to the encoder\'s output, completely bypassing the quantization operation. This approximation, however, comes with substantial drawbacks: it frequently leads to sub-optimal model performance, significant training instabilities, and the undesirable phenomenon of "codebook collapse," where a large portion of the codebook vectors become under-utilized or entirely unused (Mentzer et al., 2023; Dhariwal et al., 2020). These issues severely limit the information capacity of the VQ-VAE\'s bottleneck. Current research efforts to mitigate these problems can be broadly categorized into two main areas: (1) methods that circumvent the STE and (2) methods that enhance codebook-model interactions.', "> Sidestepping the STE. Prior works have explored alternatives to deterministic vector quantization to address STE-related issues. Baevski et al. (2019) utilize the Gumbel-Softmax trick (Jang et al., 2016) to learn a categorical distribution over codebook vectors, which eventually hardens to a one-hot distribution. Gautam et al. (2023) propose quantization via a convex combination of codebook vectors, and Takida et al. (2022) introduce stochastic quantization. Diverging from probabilistic approaches, Huh et al. (2023) employ an alternating optimization scheme where the encoder learns representations close to codebook vectors, while the decoder reconstructs from fixed codebook inputs. Although these methods circumvent STE's training instabilities, they often introduce their own complexities, such as reduced codebook utilization during inference, the need for careful temperature schedule tuning (Zhang et al., 2023), or intricate multi-stage optimization. Consequently, a vast number of state-of-the-art generative modeling applications and research continue to rely on VQ-VAEs trained with the STE (Rombach et al., 2022; Chang et al., 2022; Huang et al., 2023; Zhu et al., 2023; Dong et al., 2023).", '> ', '> Codebook-Model Improvements. Another avenue for addressing codebook collapse or under-utilization involves modifying the codebook lookup mechanism or the codebook learning process itself. Examples include using cosine similarity (Yu et al., 2021) or hyperbolic metrics (Goswami et al., 2024) instead of Euclidean distance for codebook lookups, or stochastically sampling codes based on the distance between encoder output and codebook vectors (Lee et al., 2022). Other works focus on dynamic codebook management: Kolesnikov et al. (2022) split highly-used codebook vectors, while Dhariwal et al. (2020); Łańcucki et al. (2020); Zheng & Vedaldi (2023) re-initialize or "resurrect" under-utilized vectors. Chen et al. (2024) dynamically select from multiple codebooks, and Mentzer et al. (2023); Zhao et al. (2024); Yu et al. (2023); Chiu et al. (2022) fix codebook vectors to a predefined geometry, eliminating codebook learning altogether. Additionally, various loss penalties have been proposed to encourage codebook utilization, such as KL-divergence penalties (Zhang et al., 2023) or entropy loss terms (Yu et al., 2023). While these methods effectively target specific training difficulties, they fundamentally operate within the STE framework, meaning the inherent training instabilities and information loss caused by this estimator persist. The **Rotation Trick**, introduced in this paper, is orthogonal to these approaches and can significantly enhance their performance by providing a more informed gradient signal, as demonstrated across a wide range of VQ-VAE training paradigms in Section 5.', '> ', '18c19', '< In the subsequent analysis, we focus only on the ∥x -x∥ 2 2 term as the other two are not functions of the decoder. During backpropagation, the model must differentiate through the vector quantization where ∂L ∂q represents backpropagation through the decoder, ∂q ∂e represents backpropagation through the vector quantization layer, and ∂e ∂x represents backpropagation through the encoder. As vector quantization is not a smooth transformation, ∂q ∂e cannot be computed and gradients cannot flow through this term to update the encoder in backpropagation.', '---', "> For the subsequent analysis, we primarily focus on the ∥x -x∥ 2 2 term, as the other two do not directly depend on the decoder's output. During backpropagation, the model must differentiate through the vector quantization layer. The total gradient ∂L ∂x is composed as ∂L ∂x = ∂L ∂q ∂q ∂e ∂e ∂x, where ∂L ∂q represents backpropagation through the decoder, ∂q ∂e through the vector quantization layer, and ∂e ∂x through the encoder. Since vector quantization is a non-smooth, non-differentiable operation, ∂q ∂e cannot be directly computed, thereby blocking gradient flow to the encoder.", '31,32c32,33', "< As discussed in Section 3, updating the encoder's parameters by approximating, or exactly, computing the gradient at the encoder's output is undesirable. Similarly, the STE appears to lose information: the location of e within the quantized region-be it close to q or far away at the boundary-has no impact on the gradient update to the encoder. Capturing this information, i.e. using the location of e in relation to q to transform the gradients through ∂q ∂e , could be beneficial to the encoder's gradient updates and an improvement over the STE.", '< Viewed geometrically, we ask how to move the gradient ∇ q L from q to e, and what characteristics of ∇ q L and q should be preserved during this movement. The STE offers one possible answer: move the gradient from q to e so that its direction and magnitude are preserved. However, this paper supplies a different answer: move the gradient so that the angle between ∇ q L and q is preserved as ∇ q L moves to e. We term this approach "the rotation trick", and in Section 4.3 we show that preserving the angle between q and ∇ q L conveys desirable properties to how points move within the same quantized region.', '---', "> As established in Section 3, updating the encoder's parameters by either approximating or exactly computing the gradient at the encoder's output is often suboptimal for VQ-VAEs. Furthermore, the STE inherently discards critical information: the precise location of `e` within its quantized region—whether it is close to `q` or at its boundary—has no influence on the gradient update to the encoder. We posit that explicitly leveraging this geometric relationship between `e` and `q` to transform gradients through ∂q ∂e could significantly benefit the encoder's gradient updates, offering a substantial improvement over the STE.", '> Geometrically, the core problem is how to effectively transport the gradient ∇ q L from `q` to `e`, and which characteristics of ∇ q L and `q` should be preserved during this process. The STE provides one answer: transport the gradient from `q` to `e` by preserving its direction and magnitude. This paper, however, proposes a fundamentally different and more insightful solution: transport the gradient such that the angle between ∇ q L and `q` is preserved as ∇ q L moves to `e`. We term this novel approach "the Rotation Trick." In Section 4.3, we rigorously demonstrate that preserving this angle imparts highly desirable properties to how points evolve within the same quantized region, leading to enhanced codebook utilization and reduced quantization error.', '34,35c35,37', '< Section: THE ROTATION TRICK PRESERVES ANGLES', "< In this section, we formally define the rotation trick. For encoder output e, let q = Q(e) represent the corresponding codebook vector. Q(•) is non-differentiable so gradients cannot flow through this layer during the backward pass. The STE solves this problem-maintaining the direction and magnitude of the gradient ∇ q L-as ∇ q L moves from q to e with some clever hacking of the backpropagation function in deep learning frameworks: q = e -(q -e) constant which is a parameterization of vector quantization that sets the gradient at the encoder output to the gradient at the decoder's input. The rotation trick offers a different parameterization: casting the forward pass as a rotation and rescaling that aligns e with q:", '---', '> Section: FORMAL DEFINITION OF THE ROTATION TRICK AND ANGLE PRESERVATION', '> This section formally defines the Rotation Trick and elucidates its unique property of angle preservation. For an encoder output `e`, let `q = Q(e)` denote its corresponding codebook vector. As `Q(•)` is a non-differentiable operation, direct gradient flow through this layer is impossible during the backward pass. The STE addresses this by maintaining the direction and magnitude of the gradient ∇ q L as it moves from `q` to `e`, achieved through a "straight-through" approximation that effectively sets the gradient at the encoder output to the gradient at the decoder\'s input.', '> The Rotation Trick, however, introduces a fundamentally different parameterization. It casts the forward pass as a precise rotation and rescaling operation that aligns `e` with `q`. This is achieved as follows:', '38,40c40,43', '< Section: q =', '< ∥q∥ ∥e∥ R constant e R is the rotation1 transformation that aligns e with q and ∥q∥ ∥e∥ rescales e to have the same magnitude as q. Note that both R and ∥q∥ ∥e∥ are functions of e. To avoid differentiating through this dependency, we treat them as fixed constants-or detached from the computational graph in deep learning frameworkswhen differentiating. This choice is explained in Appendix A.7.', '< While the rotation trick does not change the output of the forward pass, the backward pass changes. Rather than set ∂q ∂e = I as in the STE, the rotation trick sets ∂q ∂e to be a rotation and rescaling transformation:', '---', '> The forward pass is defined as:', '> q = ∥q∥ ∥e∥ R constant e', "> Here, `R` denotes the rotation transformation that aligns `e` with `q`, and `∥q∥ ∥e∥` is a rescaling factor that matches `e`'s magnitude to `q`'s. Both `R` and `∥q∥ ∥e∥` are functions of `e`. To prevent differentiating through this dependency, they are treated as fixed constants (or detached from the computational graph in deep learning frameworks) during backpropagation. This crucial choice is further elaborated in Appendix A.7.", '> Although the Rotation Trick does not alter the output of the forward pass, it fundamentally redefines the backward pass. Instead of setting ∂q ∂e = I, as in the STE, the Rotation Trick sets ∂q ∂e to be a dynamic rotation and rescaling transformation:', '42c45', '< As a result, ∂q ∂e changes based on the position of e in the codebook partition of q, and notably, the angle between ∇ q L and q is preserved as ∇ q L moves to e. This effect is visualized in Figure 3. While the STE translates the gradient from q to e, the rotation trick rotates it so that the angle between ∇ q L and q is preserved. In a sense, the rotation trick and the STE are sibilings. They choose different characteristics of the gradient as desiderata and then preserve those characteristics as the gradient flows around the non-differentiable vector quantization operation to the encoder.', '---', '> Consequently, ∂q ∂e is no longer a static identity matrix but dynamically adapts based on the precise position of `e` within the Voronoi partition of `q`. A key outcome of this formulation is that the angle between ∇ q L and `q` is rigorously preserved as ∇ q L propagates to `e`. This effect is vividly visualized in Figure 3. While the STE merely translates the gradient from `q` to `e`, the Rotation Trick actively rotates it to maintain this critical angular relationship. In essence, the Rotation Trick and STE are both gradient estimators, but they diverge in their choice of desiderata: the STE preserves direction and magnitude, while the Rotation Trick prioritizes the preservation of the angle between the gradient and the codebook vector as it flows around the non-differentiable vector quantization operation to the encoder.', '45c48', '< The rotation transformation R that rotates e to q can be efficiently computed with Householder matrix reflections. We define ê = e ∥e∥ , q = q ∥q∥ , λ = ∥q∥ ∥e∥ , and r = ê+q ∥ê+q∥ . Then the rotation and rescaling that aligns e to q is simply:', '---', '> The rotation transformation `R` that aligns `e` with `q` within the plane spanned by both vectors can be computed with remarkable efficiency using Householder matrix reflections. We define the normalized vectors `ê = e ∥e∥` and `q = q ∥q∥`, the scaling factor `λ = ∥q∥ ∥e∥`, and the bisector vector `r = ê+q ∥ê+q∥`. The combined rotation and rescaling operation that aligns `e` to `q` is then simply:', '47c50', '< Due to space constraints, we leave the derivation of this formula to Appendix A.5. Parameterizing the rotation in this fashion avoids computing outer products and therefore consumes minimal GPU VRAM. Further, we did not detect a difference in wall-clock time between VQ-VAEs trained with the STE and VQ-VAEs trained with the rotation trick for our experiments in Section 5.', '---', '> The detailed derivation of this formula is provided in Appendix A.5. This parameterization of the rotation is highly advantageous as it avoids the computationally expensive calculation of outer products, thereby consuming minimal GPU VRAM. Furthermore, our empirical evaluations in Section 5 demonstrated no discernible difference in wall-clock training time between VQ-VAEs trained with the STE and those trained with the Rotation Trick, underscoring its practical efficiency.', '50,56c53,57', '< In the context of lossy compression, vector quantization works well when the distortion, or equivalently quantization error ∥e -q∥ 2 2 , is low and the information capacity-equivalently codebook utilization-is high (Cover, 1999). Later in Section 5, we will see that VQ-VAEs trained with the rotation trick have this desiderata-often reducing quantization error by an order of magnitude and substantially increasing codebook usage-when compared to VQ-VAEs trained with the STE. However, the underlying reason why this occurs is less clear. Change in Distance Between and After an Update In this section, we analyze the effect of the rotation trick by looking at how encoder outputs that are mapped to the same Voronoi region are updated. While the STE applies the same update to all points within the same partition, the rotation trick changes the update based on the location of points within the Voronoi region. It can push points within the same region farther apart or pull them closer together depending on the direction of the gradient vector. The former capability can correspond to increased codebook usage while the latter to lower quantization error.', '< Voronoi Partition STE Updates Rotation Trick Updates', '< Let θ be the angle between e and q and ϕ be the angle between q and ∇ q L. When ∇ q L and q point in the same direction, i.e.', '< -π/2 < ϕ < π/2, encoder outputs with large angular distance to q are pushed farther away than they would otherwise be moved by the STE update. Figure 5 illustrates this effect. The points with large angular distance (blue regions) move further away from q than the points with low angular distance (ivory regions).', '< The top right partitions of Figure 4 present an example of this effect. The two clusters of points at the boundary-with relatively large angle to the codebook vector-are pushed away while the cluster of points with small angle to the codebook vector move with it. The ability to push points at the boundary out of a quantized region and into another is desirable for increasing codebook utilization. Specifically, codebook utilization improves when points are pushed into the Voronoi regions of previously unused codebook vectors. This capability is not shared by the STE, which moves all points in the same region by the same amount.', '< When ∇ q L and q point in opposite directions, i.e. π/2 < ϕ < 3π/2, the distance among points within the same Voronoi region decreases as they are pulled towards the location of the updated codebook vector. This effect is visualized in Figure 5 (green regions) and the bottom partitions of Figure 4 show an example. Unlike the STE update-that maintains the distances among points-the rotation trick pulls points with high angular distances closer towards the post-update codebook vector. This capability is desirable for reducing the quantization error and enabling the encoder to lock on (Van Den Oord et al., 2017) to a target codebook vector.', '< Taken together, both capabilities can form a push-pull effect that achieves two desiderata of vector quantization: increasing information capacity and reducing distortion. Encoder outputs that have large Published as a conference paper at ICLR 2025 angular distance to the chosen codebook vector are "pushed" to other, possibly unused, codebook regions by outwards-pointing gradients, thereby increasing codebook utilization. Concurrent with this effect, center-pointing gradients will "pull" points loosely clustered around the codebook vector closer together, locking on to the chosen codebook vector and reducing quantization error.', '---', "> In the context of lossy compression, optimal vector quantization is characterized by both low distortion (quantization error ∥e -q∥ 2 2) and high information capacity (codebook utilization) (Cover, 1999). As demonstrated in Section 5, VQ-VAEs trained with the Rotation Trick consistently achieve these desiderata—often reducing quantization error by an order of magnitude and substantially increasing codebook usage—compared to STE-trained VQ-VAEs. This section delves into the underlying mechanisms driving these improvements. We analyze how encoder outputs mapped to the same Voronoi region are updated. While the STE applies a uniform update to all points within a given partition, the Rotation Trick adaptively modifies the update based on each point's specific location within the Voronoi region. This allows it to either push points within the same region farther apart or pull them closer together, depending on the direction of the gradient vector. The former capability directly contributes to increased codebook usage, while the latter leads to lower quantization error.", '> Let θ be the angle between `e` and `q`, and ϕ be the angle between `q` and ∇ q L.', '> When ∇ q L and `q` point in the same general direction (i.e., -π/2 < ϕ < π/2), encoder outputs with a large angular distance to `q` are pushed farther away than they would be by a standard STE update. Figure 5 illustrates this effect: points with large angular distance (blue regions) move more significantly away from `q` compared to points with low angular distance (ivory regions). The top-right partitions of Figure 4 provide a visual example of this behavior. The two clusters of points at the boundary, which have a relatively large angle to the codebook vector, are actively pushed away, while the cluster of points with a small angle to the codebook vector moves cohesively with it. This ability to intelligently push points at the boundary out of a quantized region and into another is highly desirable for increasing codebook utilization, particularly when points are directed towards previously unused codebook vectors. This nuanced capability is absent in the STE, which applies a uniform displacement to all points within a region.', '> Conversely, when ∇ q L and `q` point in opposite directions (i.e., π/2 < ϕ < 3π/2), the distance among points within the same Voronoi region decreases as they are pulled towards the location of the updated codebook vector. This "pulling" effect is visualized in Figure 5 (green regions), with the bottom partitions of Figure 4 providing a concrete example. Unlike the STE update, which preserves inter-point distances, the Rotation Trick actively pulls points with high angular distances closer towards the post-update codebook vector. This capability is highly desirable for reducing quantization error and enabling the encoder to "lock on" (Van Den Oord et al., 2017) to a target codebook vector more effectively.', '> Collectively, these two capabilities create a powerful "push-pull" effect that simultaneously achieves both key desiderata of vector quantization: increasing information capacity and reducing distortion. Encoder outputs exhibiting a large angular distance to their chosen codebook vector are "pushed" by outwards-pointing gradients into other, potentially under-utilized, codebook regions, thereby significantly boosting codebook utilization. Concurrent with this, center-pointing gradients "pull" points loosely clustered around the codebook vector closer together, effectively "locking on" to the chosen codebook vector and substantially reducing quantization error.', '59c60', "< The Appendix contains several supplementary analyses. Appendix A.2 compares the rotation trick with the STE for a non-convex synthetic example; Appendix A.4 looks at the behavior far away from the origin; and Appendix A.8 analyzes the effect of using a reflection rather than a rotation. Finally, Appendix A.9 examines scaling the gradient's norm by ∥q∥ ∥e∥ and explores alternatives.", '---', "> The Appendix contains several supplementary analyses that further explore the properties and implications of the Rotation Trick. Specifically, Appendix A.2 presents a comparison between the Rotation Trick and the STE in a non-convex synthetic optimization example, providing additional insights into their distinct behaviors. Appendix A.4 investigates the Rotation Trick's behavior when encoder outputs and codebook vectors are far from the origin. Appendix A.8 offers an analysis of using a reflection-based transformation instead of a rotation, highlighting its potential drawbacks. Finally, Appendix A.9 delves into the effect of scaling the gradient's norm by ∥q∥ ∥e∥ and explores alternative scaling factors, contributing to a more complete understanding of our proposed method.", '62,63c63,64', '< In Section 4.3, we showed the rotation trick enables behavior that would increase codebook utilization and reduce quantization error by changing how points within the same Voronoi region are updated. However, the extent to which these changes will affect applications is unclear. In this section, we evaluate the effect of the rotation trick across many different VQ-VAE paradigms.', '< We begin with image reconstruction: training a VQ-VAE with the reconstruction objective of Van Den Oord et al. (2017) and later extend our evaluation to the more complex VQGANs (Esser et al., 2021), the VQGANs designed for latent diffusion (Rombach et al., 2022), and then the ViT-VQGAN (Yu et al., 2021). Finally, we evaluate VQ-VAE reconstructions on videos using a TimeSformer (Bertasius et al., 2021) encoder and decoder. Due to space constraints, the video results are presented in Appendix A.1. In total, our empirical analysis spans 11 different VQ-VAE configurations. For all experiments, aside from handling ∂q ∂e differently, the models, hyperparameters, and training settings are identical and described in Appendix A.10.', '---', "> In Section 4.3, our Voronoi partition analysis theoretically demonstrated how the Rotation Trick's unique gradient propagation mechanism could increase codebook utilization and reduce quantization error by adaptively updating points within the same Voronoi region. To empirically validate these theoretical benefits and assess their practical impact, this section presents a comprehensive experimental evaluation of the Rotation Trick across a wide array of VQ-VAE paradigms.", '> Our evaluation begins with fundamental image reconstruction tasks, training a VQ-VAE with the objective function proposed by Van Den Oord et al. (2017). We then progressively extend our analysis to more complex and state-of-the-art architectures, including VQGANs (Esser et al., 2021), the VQGANs specifically designed for latent diffusion models (Rombach et al., 2022), and the advanced ViT-VQGAN (Yu et al., 2021). Finally, we assess VQ-VAE reconstructions in the video domain, employing a TimeSformer (Bertasius et al., 2021) as both the encoder and decoder. Due to space constraints, the detailed video results are presented in Appendix A.1. In total, our rigorous empirical analysis encompasses 11 distinct VQ-VAE configurations. Crucially, for all experiments, the models, hyperparameters, and training settings are kept identical to ensure a fair comparison, with the sole difference being the method of handling ∂q ∂e during backpropagation. A complete description of these settings is provided in Appendix A.10.', '66c67', '< We begin with a straightforward evaluation: training a VQ-VAE to reconstruct examples from ImageNet (Deng et al., 2009). Following Van Den Oord et al. (2017), our training objective is a linear combination of the reconstruction, codebook, and commitment loss:', '---', '> We initiate our experimental analysis with a fundamental evaluation: training a VQ-VAE to reconstruct images from the challenging ImageNet dataset (Deng et al., 2009). Following the methodology of Van Den Oord et al. (2017), our training objective is a linear combination of the reconstruction loss, codebook loss, and commitment loss:', '68,70c69,71', '< where β is a hyperparameter scaling constant. Following convention, we drop the codebook loss term from the objective and instead use an exponential moving average to update the codebook vectors.', '< Evaluation Settings. For 256 × 256 × 3 input images, we evaluate two different settings: (1) compressing to a latent space of dimension 32 × 32 × 32 with a codebook size of 1024 following Yu et al. ( 2021) and (2) compressing to 64 × 64 × 3 with a codebook size of 8192 following Rombach et al. (2022). In both settings, we compare with a Euclidean and cosine similarity codebook lookup. Evaluation Metrics. We log both training and validation set reconstruction metrics. Of note, we compute reconstruction FID (Heusel et al., 2017) and reconstruction IS (Salimans et al., 2016) on reconstructions from the full ImageNet validation set as a measure of reconstruction quality. We also compute codebook usage, or the percentage of codebook vectors that are used in each batch of data, as a measure of the information capacity of the vector quantization layer and quantization error ∥e -q∥ 2 2 as a measure of distortion. Baselines. Our comparison spans the STE estimator (VQ-VAE), stochastic quantization with Gumbel-Softmax (Baevski et al., 2019), (Gumbel VQ-VAE) the Hessian approximation described in Section 3 (VQ-VAE w/ Hessian Approx), the exact gradient backward pass described in Section 3 (VQ-VAE w/ Exact Gradients), and the rotation trick (VQ-VAE w/ Rotation Trick). All methods share the same architecture, hyperparameters, and training settings, and these settings are summarized in Table 8 of the Appendix. There is no functional difference among methods in the forward pass; the only differences relates to how gradients are propagated through ∂q ∂e during backpropagation. Results. Table 1 displays our findings. We find that using the rotation trick reduces the quantization error-sometimes by an order of magnitude-and improves low codebook utilization. Both results are expected given the Voronoi partition analysis in Section 4.3: points at the boundary of quantized regions are likely pushed to under-utilized codebook vectors while points loosely grouped around the codebook vector are condensed towards it. These two features appear to have a meaningful effect on reconstruction metrics: training a VQ-VAE with the rotation trick substantially improves r-FID and r-IS.', '< We also see that the Hessian Approximation or using Exact Gradients results in poor reconstruction performance. While the gradients to the encoder are, in a sense, "more accurate", training the encoder like an AutoEncoder (Hinton & Zemel, 1993) likely introduces overfitting and poor generalization. Moreover, the mismatch in training objectives between the encoder and decoder is likely an aggravating factor and partly responsible for both models\' poor performance.', '---', '> where β is a hyperparameter scaling constant. Consistent with common practice, we omit the explicit codebook loss term from the objective and instead update the codebook vectors using an exponential moving average (EMA) with a decay rate of 0.8.', "> Evaluation Settings. For 256 × 256 × 3 input images, we conduct evaluations under two distinct settings: (1) compression to a latent space of dimension 32 × 32 × 32 with a codebook size of 1024, following Yu et al. (2021), and (2) compression to 64 × 64 × 3 with a codebook size of 8192, following Rombach et al. (2022). In both settings, we compare performance using both Euclidean and cosine similarity for codebook lookup. Evaluation Metrics. We meticulously log both training and validation set reconstruction metrics. Notably, we compute reconstruction FID (r-FID) (Heusel et al., 2017) and reconstruction IS (r-IS) (Salimans et al., 2016) on reconstructions from the full ImageNet validation set, serving as robust measures of reconstruction quality. Additionally, we quantify codebook usage (the percentage of codebook vectors activated per batch) as an indicator of the vector quantization layer's information capacity, and quantization error ∥e -q∥ 2 2 as a direct measure of distortion. Baselines. Our comparative analysis includes the standard STE estimator (VQ-VAE), stochastic quantization with Gumbel-Softmax (Baevski et al., 2019) (Gumbel VQ-VAE), the Hessian approximation method described in Section 3 (VQ-VAE w/ Hessian Approx), the exact gradient backward pass from Section 3 (VQ-VAE w/ Exact Gradients), and our proposed Rotation Trick (VQ-VAE w/ Rotation Trick). All methods share identical architectures, hyperparameters, and training settings, which are comprehensively summarized in Table 8 of the Appendix. Crucially, there is no functional difference in the forward pass among these methods; the distinctions lie solely in how gradients are propagated through ∂q ∂e during backpropagation. Results. Table 1 clearly presents our findings. We observe that employing the Rotation Trick consistently reduces the quantization error—often by an order of magnitude—and significantly improves codebook utilization, particularly in scenarios where it was initially low. Both results are in strong agreement with our theoretical Voronoi partition analysis in Section 4.3: the Rotation Trick effectively pushes points at the boundary of quantized regions towards under-utilized codebook vectors, while simultaneously condensing points loosely grouped around a codebook vector more tightly towards it. These two synergistic features demonstrably have a profound positive effect on reconstruction metrics, as training a VQ-VAE with the Rotation Trick substantially improves both r-FID and r-IS.", '> Furthermore, our experiments reveal that the Hessian Approximation and Exact Gradients approaches consistently yield poor reconstruction performance. While these methods purport to provide "more accurate" gradients to the encoder, training the encoder in a manner akin to a traditional AutoEncoder (Hinton & Zemel, 1993) likely leads to overfitting and impaired generalization capabilities. Moreover, the inherent mismatch in training objectives between the encoder and decoder under these methods is a significant aggravating factor, contributing to their observed poor performance.', '73c74', '< Moving to the next level of complexity, we evaluate the effect of the rotation trick on VQGANs (Esser et al., 2021). The VQGAN training objective is:', '---', '> Advancing to the next level of architectural complexity, we rigorously evaluate the impact of the Rotation Trick on VQGANs (Esser et al., 2021). The VQGAN training objective is defined as:', '75,77c76,78', '< where L Per is the perceptual loss from Johnson et al. (2016) and replaces the L 2 loss used to train VQ-VAEs. L Adv is a patch-based adversarial loss similar to the adversarial loss in Conditional GAN (Isola et al., 2017). β is a constant that weights the commitment loss while λ is an adaptive weight based on the ratio of ∇L Per to ∇L Adv with respect to the last layer of the decoder.', '< Experimental Settings. We evaluate VQGANs under two settings: (1) the paradigm amenable to autoregressive modeling with Transformers as described in Esser et al. (2021) and (2) the paradigm suitable to latent diffusion models as described in Rombach et al. (2022). The first setting follows the convolutional neural network and default hyperparameters described in Esser et al. (2021) while the second follows those from Rombach et al. (2022). A full description of both training settings is provided in Table 9 of the Appendix.', '< Results. Our results are listed in Table 2 for the first setting and Table 3 for the second. Similar to our findings in Section 5.1, we find that training a VQ-VAE with the rotation trick substantially decreases quantization error and improves codebook usage. Moreover, reconstruction performance as measured on the validation set by the total loss, r-FID, and r-IS are improved across both modeling paradigms.', '---', '> Here, L Per represents the perceptual loss, as introduced by Johnson et al. (2016), which replaces the L 2 loss typically used in VQ-VAE training. L Adv is a patch-based adversarial loss, conceptually similar to the adversarial loss employed in Conditional GANs (Isola et al., 2017). β is a constant weighting the commitment loss, while λ is an adaptively determined weight based on the ratio of ∇L Per to ∇L Adv with respect to the last layer of the decoder.', '> Experimental Settings. We evaluate VQGANs across two distinct and widely-used paradigms: (1) the configuration optimized for autoregressive modeling with Transformers, as detailed in Esser et al. (2021), and (2) the configuration tailored for latent diffusion models, as described in Rombach et al. (2022). The first setting utilizes the convolutional neural network architecture and default hyperparameters specified by Esser et al. (2021), while the second adheres to the specifications from Rombach et al. (2022). A comprehensive description of both training settings, including all hyperparameters, is provided in Table 9 of the Appendix.', "> Results. Our empirical results are meticulously presented in Table 2 for the autoregressive setting and Table 3 for the latent diffusion setting. Consistent with our findings in Section 5.1 for basic VQ-VAEs, we observe that training VQGANs with the Rotation Trick leads to a substantial decrease in quantization error and a marked improvement in codebook usage. Furthermore, reconstruction performance, as quantified on the validation set by the total loss, r-FID, and r-IS, is consistently improved across both complex modeling paradigms. These results underscore the Rotation Trick's robustness and efficacy in enhancing VQGAN training.", '80c81,83', '< Improving upon the VQGAN model, Yu et al. (2021) propose using a ViT (Dosovitskiy, 2020)  Results. Table 4 summarizes our findings. Similar to our previous results for VQ-VAEs in Section 5.1 and VQGANs in Section 5.2, codebook utilization and reconstruction metrics are significantly improved; however in this case, the quantization error is roughly the same.', '---', '> Building upon the success of VQGANs, Yu et al. (2021) proposed the ViT-VQGAN, which integrates a Vision Transformer (ViT) (Dosovitskiy, 2020) as the backbone for both the encoder and decoder. This architecture further enhances performance by leveraging the powerful representational capabilities of Transformers. The ViT-VQGAN also incorporates factorized codes and L2 normalization on the input and output of the vector quantization layer to improve training stability and overall performance. Additionally, the authors modified the training objective by introducing a logit-Laplace loss and reinstating the L2 reconstruction error alongside the adversarial and commitment losses.', '> Experimental Settings. We closely follow the open-source implementation provided by https://github.com/thuanz123/enhancing-transformers and utilize the default model and hyperparameter settings for the small ViT-VQGAN. A complete and detailed description of the training settings can be found in Table 10 of the Appendix.', '> Results. Table 4 comprehensively summarizes our findings for the ViT-VQGAN. Consistent with our previous observations for VQ-VAEs in Section 5.1 and VQGANs in Section 5.2, we demonstrate that codebook utilization and key reconstruction metrics are significantly improved when training with the Rotation Trick. Notably, in this specific configuration, the quantization error remains roughly comparable to the baseline STE, suggesting that the primary benefits manifest through enhanced codebook dynamics and overall reconstruction quality.', '82a86', '> While the Rotation Trick offers substantial improvements, it is important to acknowledge its limitations. A potential issue can arise when encoder outputs (`e`) or codebook vectors (`q`) are constrained to have near-zero norms (i.e., `∥e∥ ≈ 0` or `∥q∥ ≈ 0`). In such scenarios, the angle between `e` and `q` may become obtuse. When this occurs, the Rotation Trick can "over-rotate" the gradient ∇ q L as it is transported from `q` to `e`, leading to `∇ q L` and `∇ e L` pointing in divergent directions (i.e., the cosine of the angle between `∇ e L` and `∇ q L` becomes negative). This undesirable effect is visualized in Figure 6. This is problematic because, when the angle between `e` and `q` is obtuse, the Rotation Trick violates the crucial assumption that `∇ q L ≈ ∇ e L` when `e ≈ q`. This can lead to degraded performance compared to VQ-VAEs trained with the STE. Although obtuse angles between `e` and `q` are generally rare—as codebook vectors are inherently designed to be "angularly close" to their mapped inputs—any architectural or training constraint that forces codewords to have near-zero norms could exacerbate this limitation. Future work could explore adaptive mechanisms or regularization strategies to mitigate this specific scenario, such as dynamically adjusting the scaling factor or introducing angular constraints.', '83a88,89', '> Section: CONCLUSION', '> In this work, we embarked on a comprehensive exploration of gradient propagation mechanisms through the non-differentiable vector quantization layer of VQ-VAEs. Our central finding is that preserving the angle—rather than solely the direction—between the codebook vector and the gradient induces profoundly desirable effects on how points within the same codebook region are updated. This geometrically-informed approach, which we term the Rotation Trick, leads to a synergistic "push-pull" effect that simultaneously enhances codebook utilization and reduces quantization error. These fundamental improvements translate directly into substantial gains in overall model performance. Across 11 diverse experimental settings, ranging from basic VQ-VAEs to advanced VQGANs and ViT-VQGANs, we consistently demonstrate that training VQ-VAEs with the Rotation Trick significantly improves their reconstruction quality. For instance, when applied to a VQGAN configuration used in latent diffusion, the Rotation Trick dramatically improved r-FID from 5.0 to 1.1 and r-IS from 141.5 to 200.2, while simultaneously reducing quantization error by two orders of magnitude and boosting codebook usage by an impressive 13.5x. These results underscore the Rotation Trick\'s efficacy in addressing core limitations of discrete representation learning and offer a robust pathway to more stable and performant VQ-VAE models.', '85,87c91,92', '< Section: Rotation Trick STE', '< Figure 6: Illustration of the rotation trick "over-rotating" vectors when the angle between e1 and q is obtuse.', '< A limitation of the rotation trick can arise when the encoder outputs or codebook vectors are forced to be close to 0 norm (i.e., ∥e∥ ≈ 0 or ∥q∥ ≈ 0). In this case, the angle between e and q may be obtuse. When this happens, the rotation trick will "over-rotate" the gradient ∇ q L as it is transported from q to e so that ∇ q L and ∇ e L now point in different directions (i.e. the cosine of the angle between ∇ e L and ∇ q L will be negative). An example is visualized in Figure 6. This is undesirable because-when the angle between e and q is obtuse-the rotation trick will violate the assumption that when e ≈ q, ∇ q L ≈ ∇ e L, and it will likely result in worse performance than VQ-VAEs trained with the STE. While obtuse angles between e and q are very unlikely-by design, the codebook vectors should be "angularly close" to the vectors that are mapped to them-however, if there is a restriction that forces codewords to have near 0 norm, then the rotation trick will likely perform worse than the STE.', '---', '> Section: A.1 VIDEO EVALUATION', '> Expanding our analysis beyond the image modality, we evaluate the effect of the Rotation Trick on video reconstructions from the BAIR Robot dataset (Ebert et al., 2017) and the UCF101 action recognition dataset (Soomro, 2012). We adopt the quantization paradigm used by ViT-VQGAN, but replace the ViT with a TimeSformer (Bertasius et al., 2021) for both the encoder and decoder, as detailed in Appendix A.10.4.', '89,90c94,95', '< Section: CONCLUSION', "< In this work, we explore different ways to propagate gradients through the vector quantization layer of VQ-VAEs and find that preserving the angle-rather than the direction-between the codebook vector and gradient induces desirable effects for how points within the same codebook region are updated. These effects cause a substantial improvement in model performance. Across 11 different settings, we find that training VQ-VAEs with the rotation trick improves their reconstructions. For example, training one of the VQGANs used in latent diffusion with the rotation trick improves r-FID from 5.0 to 1.1 and r-IS from 141.5 to 200.2, reduces quantization error by two orders of magnitude, and increases codebook usage by 13.5x. A.1 VIDEO EVALUATION Expanding our analysis beyond the image modality, we evaluate the effect of the rotation trick on video reconstructions from the BAIR Robot dataset (Ebert et al., 2017) and from the UCF101 action recognition dataset (Soomro, 2012). We follow the quantization paradigm used by ViT-VQGAN, but replace the ViT with a TimeSformer (Bertasius et al., 2021)   To supplement our analysis in Section 4.3, we include a numerical simulation of vector quantization for minimizing Himmelblau's function (Figure 7) across 100 gradient updates for the STE and rotation trick gradient estimators to highlight the differences in their behaviors. Our simulation uses an EMA with a decay rate of 0.8 as described in Van Den Oord et al. (2017) to update the codebook vectors and a learning rate of 1e -3 to update the pre-quantized points.", '---', '> Section: A.2 NUMERICAL SIMULATION', "> To supplement our analysis in Section 4.3, we include a numerical simulation of vector quantization for minimizing Himmelblau's function (Figure 7). This simulation tracks 100 gradient updates for both the STE and Rotation Trick gradient estimators, highlighting the distinct behaviors of each. Our simulation utilizes an Exponential Moving Average (EMA) with a decay rate of 0.8, as described in Van Den Oord et al. (2017), to update the codebook vectors, and a learning rate of 1e-3 for updating the pre-quantized points.", '93c98', '< Points for both the STE and the rotation trick simulation use the same random initialization for both codewords and pre-quantized vectors. The only difference is whether the STE or the rotation trick is used as the gradient estimator through the vector quantization operation. ', '---', '> Points for both the STE and the Rotation Trick simulation use the same random initialization for both codewords and pre-quantized vectors. The only difference is whether the STE or the Rotation Trick is used as the gradient estimator through the vector quantization operation.', '96,98c101,103', '< In this section, we expand our analysis in Section 3 and offer some intuition for why using exact gradients, or a Hessian approximation of the exact gradients, may convey undesirable characteristics. We begin by showing the Hessian approximates the exact gradient up to second order term with a', '< The Taylor series expansion. We can write the loss L e exactly as an infinite series of around q:', '< L e = L q + (∇ q L) T (e -q) + 1 2 (e -q) T (∇ 2 q L)(e -q) + 1 6 (e -q) T ∇ 3 q L(e -q, e -q) + . . . . so that the loss computed by the Hessian approximation differs from the loss computed with the exact gradients method by the remainder term from truncating the Taylor series expansion after the second term:', '---', '> In this section, we expand upon our analysis in Section 3, providing further intuition as to why employing exact gradients or Hessian approximations of these gradients can lead to undesirable characteristics in VQ-VAE training. We begin by demonstrating how the Hessian approximates the exact gradient up to a second-order term via a Taylor series expansion. The loss L e can be written exactly as an infinite series around q:', '> L e = L q + (∇ q L) T (e -q) + 1 2 (e -q) T (∇ 2 q L)(e -q) + 1 6 (e -q) T ∇ 3 q L(e -q, e -q) + . . . .', '> Thus, the loss computed by the Hessian approximation differs from the loss computed with the exact gradients method by the remainder term resulting from truncating the Taylor series after the second term:', '102,107c107,112', '< where O(∥e -q∥ 3 ) = 1 6 (e -q) T ∇ 3 q L(e -q, e -q) + . . . Hessian. As the loss in each partition is quadratic, the exact gradient will equal the Hessian approximation.', '< Notice that when q ≈ e, ∇qL ≈ ∇eL for both the STE and the rotation trick. As the Hessian approximation and exact gradients use the curvature of the loss surface to move ∇qL from q to e, the direction of the gradient can change substantively, even when q ≈ e.', '< The Hessian idea described in Section 3 approximates the exact gradients to the encoder as if quantization did not occur, i.e. it approximates the gradient used to update the encoder in the original AutoEncoder (Hinton & Zemel, 1993) model.', '< We now explore some instances where the exact gradients, or their Hessian approximation, may produce undesirable behavior in vector quantization. An inductive bias (Baxter, 2000) for vector quantization to work well is that when e is "close" to q, their gradients are also "close", i.e. if e ≈ q then ∇ e L ≈ ∇ q L. Intuitively, if the distortion between e and q is small-i.e. q is a very good codeword for e-then these points should move together during a gradient update. If they do not, the distortion would increase.', '< This assumption holds for both the STE and Rotation Trick gradients; however, it can be violated by the Hessian approximation or the exact gradient approaches, especially when the curvature around q is negative or the Hessian is indefinite and forms a saddle point.', '< Figure 9 illustrates three such cases. As both the STE and Rotation Trick do not use the loss surface to move ∇ q L from q to e, when q ≈ e, ∇ q L ≈ ∇ e L. However, approaches that use the curvature around q, such as the Hessian approximation or exact gradients, to either find or approximate the loss at e can have ∇ e L point in a very different direction from ∇ q L, even when q is close to e. The top-left and bottom partitions of Figure 9 scatter the gradients as they move from q to the points in these partitions due to negative curvature. A similar effect occurs in the top-right partition of Figure 9 due to the presence of a saddle point.', '---', '> where O(∥e -q∥ 3 ) = 1 6 (e -q) T ∇ 3 q L(e -q, e -q) + . . . . Notably, if the loss in each partition is perfectly quadratic, the exact gradient will precisely equal the Hessian approximation.', '> A critical observation is that when `q ≈ e`, `∇qL ≈ ∇eL` for both the STE and the Rotation Trick. However, because the Hessian approximation and exact gradients explicitly utilize the curvature of the loss surface to transport `∇qL` from `q` to `e`, the direction of the gradient can change substantially, even when `q` is very close to `e`. This is a significant departure from the intuitive behavior desired in vector quantization.', '> The Hessian-based approach described in Section 3 approximates the gradients to the encoder as if quantization did not occur, effectively mimicking the gradient used to update the encoder in the original AutoEncoder (Hinton & Zemel, 1993) model.', '> We now delve into specific instances where exact gradients or their Hessian approximations may produce undesirable behavior in vector quantization. A core inductive bias (Baxter, 2000) for effective vector quantization is that when `e` is "close" to `q`, their gradients should also be "close"—i.e., if `e ≈ q`, then `∇ e L ≈ ∇ q L`. Intuitively, if the distortion between `e` and `q` is minimal (meaning `q` is an excellent codeword for `e`), these points should ideally move together during a gradient update to maintain or further reduce distortion.', '> This fundamental assumption holds true for both the STE and Rotation Trick gradients. However, it can be severely violated by the Hessian approximation or exact gradient approaches, particularly when the local curvature around `q` is negative or when the Hessian is indefinite, forming a saddle point.', '> Figure 9 vividly illustrates three such problematic cases. Since neither the STE nor the Rotation Trick explicitly uses the loss surface to transport `∇ q L` from `q` to `e`, when `q ≈ e`, `∇ q L ≈ ∇ e L` is maintained. In stark contrast, approaches that leverage the curvature around `q`—such as the Hessian approximation or exact gradients—to find or approximate the loss at `e` can cause `∇ e L` to point in a drastically different direction from `∇ q L`, even when `q` is in close proximity to `e`. For instance, the top-left and bottom partitions of Figure 9 show gradients scattering as they move from `q` to points within these partitions due to negative curvature. A similar detrimental effect is observed in the top-right partition of Figure 9 due to the presence of a saddle point, highlighting the instability introduced by curvature-aware gradient transport in these scenarios.', '110c115', "< Unlike the STE, the rotation trick is not invariant to the location of the origin. In this section, we explore this characteristic and its effect on how points within the same Voronoi region are updated. For example, suppose each codebook vector and encoder output in Figure 4 were shifted by some at the codebook vector (orange circle) when all points are far from the origin. The STE is invariant to the this however as the angle between e and q decreases as these vectors translated away from the origin, the effect of the rotation trick will decrease. In the limit, the rotation trick reduces to the STE. constant vector so that each now has all positive components. How would this affect the rotation trick's gradient estimator?", '---', '> Unlike the STE, the Rotation Trick is not inherently invariant to the absolute location of the origin. In this section, we rigorously explore this characteristic and its implications for how points within the same Voronoi region are updated. For instance, consider a scenario where each codebook vector and encoder output in Figure 4 were shifted by a constant vector `d` such that all components become positive. The STE, by its nature, is invariant to such global translations. However, as these vectors are translated further away from the origin, the angle between `e` and `q` tends to decrease, consequently diminishing the distinct effect of the Rotation Trick. In the limit, as the displacement approaches infinity, the Rotation Trick smoothly converges to behave identically to the STE.', '112c117', '< Consider one codebook vector q and one encoder output e separated by angle θ. We define q = q + d and ê = e + d where d is some large displacement vector. Let θ be the angle between q and ê. We visualize this example in Figure 11. From the law of cosines:', '---', "> Let's consider a codebook vector `q` and an encoder output `e` separated by an angle θ. We define the translated vectors as `q = q + d` and `ê = e + d`, where `d` is a large displacement vector. Let θ be the angle between `q` and `ê`. This example is visualized in Figure 11. From the law of cosines, we have:", '116c121', '< Substituting, we find that', '---', '> Substituting and simplifying, we find that:', '118c123', '< and consider the case when q and ê are far from the origin, i.e.,∥d∥ >> ∥q∥, ∥e∥. Then we have:', '---', '> Now, consider the case where `q` and `ê` are far from the origin, i.e., `∥d∥ >> ∥q∥, ∥e∥`. In this limit, we approximate:', '120,121c125,126', '< So as d → ∞, θ → 0. This implies that ∥q∥ ∥ê∥ → 1 and R → I, which is exactly the STE update. As points move away from the origin, the rotation trick smoothly transforms into the STE.', '< We visualize an example of this effect in Figure 10, where each point from Figure 4 is translated by positive ten along each dimension. As illustrated above, the effect for the "push" gradient in the top-right quadrant remains but it\'s effect is reduced, i.e., more similar to the STE update. The top-left partition becomes a "pull" because the gradient now points towards the origin, so points within this region move closer together. Finally, the gradient in the bottom region no longer points towards the origin, but is now more orthogonal to the codebook vector. As a result, we see more of a rotation applied to the points in this region than the contraction that is depicted in Figure 4.', '---', '> This implies that as `∥d∥ → ∞`, `θ → 0`. Consequently, `∥q∥ ∥ê∥ → 1` and `R → I`, which precisely corresponds to the STE update. Thus, as points move sufficiently far from the origin, the Rotation Trick gracefully and smoothly transforms into the STE.', '> We visualize an example of this effect in Figure 10, where each point from Figure 4 is translated by a positive value of ten along each dimension. As predicted by our analysis, the "push" effect of the gradient in the top-right quadrant persists, but its magnitude is reduced, making it more similar to the STE update. Interestingly, the top-left partition now exhibits a "pull" behavior because the gradient\'s relative direction points towards the origin, causing points within this region to converge. Finally, the gradient in the bottom region is no longer directed towards the origin but becomes more orthogonal to the codebook vector. As a result, we observe more of a rotational effect applied to these points compared to the contraction depicted in Figure 4.', '124,125c129,131', '< For any given e and q, the rotation R that aligns e with q in the plane spanned by both vectors can be efficiently computed with Householder matrix reflections.', '< Definition 1 (Householder Reflection Matrix). For a unit norm vector a ∈ R d , I -2aa T ∈ R d×d is reflection matrix across the subspace (hyperplane) orthogonal to a. Returning to vector quantization with q = [ ∥q∥ ∥e∥ R]e, we can write R as the product of two Householder reflection matrices that rotates e to q in the plane spanned between them. Without loss of generality, assume e and q are unit norm, and let θ be the angle between e and q. Setting r = e+q ∥e+q∥ and simplifying yields:', '---', '> For any given encoder output `e` and codebook vector `q`, the rotation `R` that precisely aligns `e` with `q` within the plane spanned by these two vectors can be computed with high efficiency using Householder matrix reflections.', '> Definition 1 (Householder Reflection Matrix). For a unit norm vector `a ∈ R d`, the matrix `I -2aa T ∈ R d×d` represents a reflection across the subspace (hyperplane) orthogonal to `a`.', '> Returning to the context of vector quantization with `q = [ ∥q∥ ∥e∥ R]e`, we can express `R` as the product of two Householder reflection matrices. These reflections collaboratively rotate `e` to `q` within their shared plane. Without loss of generality, assume `e` and `q` are unit norm vectors, and let `θ` be the angle between them. By setting `r = e+q ∥e+q∥` and performing algebraic simplification, we arrive at the following efficient form for `R`:', '126a133', '> This derivation demonstrates that the rotation matrix can be constructed without explicitly computing trigonometric functions, relying instead on vector operations and Householder reflections, which are numerically stable and computationally efficient.', '128,131c135,138', '< Section: A.6 PROOF THE ROTATION TRICK PRESERVES ANGLES', '< For encoder output e and corresponding codebook vector q, we provide a formal proof that the rotation trick preserves the angle between ∇ q L and q as ∇ q L moves to e. Unlike the notation in the main text, which assumes q ∈ R d×1 , we use batch notation in the following proof to illustrate how the rotation trick works when training neural networks. Specifically, q ∈ R b×d and R ∈ R b×d×d where b is the number of examples in a batch and d is the dimension of the codebook vector.', '< Remark 3. The angle between q and ∇ q L is preserved as ∇ q L moves to e.', '< Proof. With loss of generality, suppose ∥e∥ = ∥q∥ = 1. Then we have', '---', '> Section: A.6 PROOF OF ANGLE PRESERVATION IN THE ROTATION TRICK', '> For a given encoder output `e` and its corresponding codebook vector `q`, we provide a formal proof demonstrating that the Rotation Trick rigorously preserves the angle between `∇ q L` and `q` as `∇ q L` is transported to `e`. Unlike the notation in the main text, which assumes `q ∈ R d×1`, this proof utilizes batch notation to accurately reflect its application in neural network training. Specifically, `q ∈ R b×d` and `R ∈ R b×d×d`, where `b` is the batch size and `d` is the dimension of the codebook vector.', '> Remark 3. The angle between `q` and `∇ q L` is preserved as `∇ q L` moves to `e`.', "> Proof. Without loss of generality, let us assume `∥e∥ = ∥q∥ = 1` for simplicity. Under the Rotation Trick's forward pass, we have:", '133c140', '< The gradient at e will then equal:', '---', '> The gradient at `e` in the backward pass will then be:', '135c142', '< Let θ be the angle between q and ∇ q L and ϕ be the angle between e and ∇ q L. Via the Euclidean inner product, we have:', '---', '> Let `θ` be the angle between `q` and `∇ q L`, and `ϕ` be the angle between `e` and `∇ e L`. Using the definition of the Euclidean inner product, we can write:', '137c144,145', '< = ∥∇ q L∥ cos ϕ so θ = ϕ and the angle between q and ∇ q L is preserved as ∇ q L moves to e.', '---', '> = ∥∇ q L∥ cos ϕ', '> Since `∥∇ q L∥` is a non-zero scalar, it follows that `cos θ = cos ϕ`, which implies `θ = ϕ`. Therefore, the angle between `q` and `∇ q L` is indeed preserved as `∇ q L` moves to `e` under the Rotation Trick. This property is fundamental to the improved gradient dynamics observed.', '139,142c147,149', '< Section: A.7 TREATING R AND ||q||', '< ||e|| AS CONSTANTS', '< In the rotation trick, we treat R and ||q|| ||e|| as constants and detached from the computational graph during the forward pass of the rotation trick. In this section, we explain why this is the case.', '< The rotation trick computes the input to the decoder q after performing a non-differentiable codebook lookup on e to find q. It is defined as:', '---', '> Section: A.7 RATIONALE FOR TREATING R AND ||q|| ||e|| AS CONSTANTS', '> In the Rotation Trick, the rotation matrix `R` and the scaling factor `||q|| ||e||` are treated as constants and explicitly detached from the computational graph during the forward pass. This section provides a detailed explanation for this design choice.', '> The Rotation Trick computes the input to the decoder, `q`, subsequent to performing a non-differentiable codebook lookup on `e` to determine `q`. The operation is defined as:', '144c151', '< As shown in Section 4, R is a function of both e and q. However, using the quantization function Q(e) = q, we can rewrite both ||q|| ||e|| and R as a single function of e:', '---', '> As previously established in Section 4, both `R` and `||q|| ||e||` are functions of `e` (and implicitly `q` via `Q(e)`). However, by utilizing the quantization function `Q(e) = q`, we can express both `||q|| ||e||` and `R` as a single, composite function of `e`:', '146c153', '< The rotation trick then becomes', '---', '> The Rotation Trick can then be written as:', '148c155', '< and differentiating q with respect to e gives us:', '---', '> Differentiating `q` with respect to `e` using the product rule would yield:', '150c157', '< However, f ′ (e) cannot be computed as it would require differentiating through Q(e), which is a nondifferentiable codebook lookup. We therefore drop this term and use only f (e) as our approximation of the gradient through the vector quantization layer: ∂ q ∂e = f (e). This approximation conveys more information about the vector quantization operation than the STE, which sets ∂ q ∂e = I.', '---', '> However, the term `f ′ (e)` cannot be computed because its derivation would necessitate differentiating through `Q(e)`, which is inherently a non-differentiable codebook lookup operation. Consequently, we deliberately drop the `f ′ (e)e` term and utilize only `f (e)` as our approximation of the gradient through the vector quantization layer: `∂ q ∂e = f (e)`. This approximation is a crucial design decision. It is important to note that this approximation, while simplifying the gradient computation, still conveys significantly more geometric information about the vector quantization operation than the standard STE, which merely sets `∂ q ∂e = I`. The preserved angular and magnitude relationships, as discussed in Section 4, are maintained through `f(e)`, leading to superior training dynamics.', '152,154d158', '< Section: Gradient at Rotation Trick Reflection Trick STE', "< Figure 12: Illustration of how the gradient at q moves to e via the STE, the rotation trick, and the reflection trick. The reflection trick matches the behavior of the rotation trick when the gradient ∇qL is parallel to q. However, it will reverse the components of the gradients orthogonal to q for points in q's partition. This effect is illustrated in the bottom two rows of the rightmost column.  ", '< ', '156,158c160', '< One may also use a single reflection to align e to q, rather than a rotation. For instance, using the notation from Appendix A.5, setting r = e-q ∥e-q∥ and reflecting across the plane orthogonal to this vector via the Householder reflection (I -2rr T ) will reflect e to q. We denote this reflection as R so that q = ∥q∥ ∥e∥ Re. We call this approach "the reflection trick."', '< The reflection trick can result in undesirable behavior during the backward pass. While it replicates the rotation trick when ∇ q L is parallel to q, as illustrated in the top two rows of Figure 12 and the top-right and bottom regions of Figure 13, it reflects orthogonal components of the gradient across the hyperplane orthogonal to e -q so that these components are reversed. Simply, if the quantized gradient points "left" then the reflected gradient will point "right", and vice-versa. This behavior is undesirable for points with low distortion, e ≈ q, because it will cause e to move away from q along the components of the gradient orthogonal to q, thereby increasing distortion for two points that are a "good match". The top-left partition of Figure 13 illustrates one such example. In this case, the gradient pushes the codebook vector "left" while the points in this region are pushed in the opposite direction of the gradient.', '< We evaluate this effect experimentally following the VQ-VAE evaluation paradigm from Table 1 and the VQGAN evaluation paradigm from Table 3. While we did not train these models to completion due to GPU resource limitations, both paradigms exhibited poor convergence when trained with the reflection trick. Specifically, after one epoch, the validation loss was approximately 3x higher than the rotation trick for both 8192 and 16384 codebook VQGANs in Table 3. For the Euclidean codebook model with latent Shape 64 × 64 × 3 in Table 1, the validation loss was approximately 2x higher than the rotation trick after 15 epochs.', '---', "> Figure 12: Illustration of how the gradient at q moves to e via the STE, the Rotation Trick, and the Reflection Trick. The Reflection Trick matches the behavior of the Rotation Trick when the gradient ∇qL is parallel to q. However, it will reverse the components of the gradients orthogonal to q for points in q's partition. This effect is illustrated in the bottom two rows of the rightmost column.", '159a162,165', '> One might consider using a single reflection to align `e` to `q`, rather than a rotation. For instance, using the notation from Appendix A.5, by setting `r = e-q ∥e-q∥` and reflecting across the hyperplane orthogonal to this vector via the Householder reflection `(I -2rr T)`, `e` will be reflected to `q`. We denote this transformation as `R\'` and define the forward pass as `q = ∥q∥ ∥e∥ R\'e`. We term this alternative approach "the Reflection Trick."', '> The Reflection Trick, however, can lead to undesirable and unstable behavior during the backward pass. While it mimics the Rotation Trick\'s behavior when `∇ q L` is perfectly parallel to `q` (as illustrated in the top two rows of Figure 12 and the top-right and bottom regions of Figure 13), it fundamentally differs by reflecting the orthogonal components of the gradient across the hyperplane orthogonal to `e -q`. This means that if the quantized gradient points "left," the reflected gradient will point "right," and vice-versa. This reversal of orthogonal components is highly undesirable, especially for points with low distortion (i.e., `e ≈ q`). In such cases, the Reflection Trick can cause `e` to move *away* from `q` along the gradient components orthogonal to `q`, thereby increasing distortion for points that are otherwise a "good match." The top-left partition of Figure 13 vividly illustrates this problematic effect: the gradient pushes the codebook vector "left," while the points in that region are paradoxically pushed in the opposite direction of the gradient.', '> We experimentally evaluated this effect following the VQ-VAE evaluation paradigm from Table 1 and the VQGAN evaluation paradigm from Table 3. Even without training these models to full convergence due to GPU resource limitations, both paradigms consistently exhibited poor and unstable convergence when trained with the Reflection Trick. Specifically, after just one epoch, the validation loss was approximately 3x higher than that achieved by the Rotation Trick for both 8192 and 16384 codebook VQGANs (as in Table 3). For the Euclidean codebook model with latent shape 64 × 64 × 3 (as in Table 1), the validation loss was approximately 2x higher than the Rotation Trick after only 15 epochs. These results strongly suggest that the Reflection Trick introduces severe training instabilities and is not a viable alternative to the Rotation Trick.', '> ', '161c167', '< In this section, we analyze the effect of the ∥q∥ ∥e∥ term in the rotation trick. While this norm rescaling is necessary to transform e into q during the forward pass, one could avoid the multiplicative factor by instead formulating the rotation trick as:', '---', '> In this section, we delve into a detailed analysis of the `∥q∥ ∥e∥` term within the Rotation Trick. While this norm rescaling is inherently necessary to transform `e` into `q` during the forward pass, one could potentially bypass this multiplicative factor by formulating the Rotation Trick additively as:', '163,166c169,172', '< A possible benefit of this latter formulation is that ∂q ∂e = R, an orthogonal transformation with determinant one that does not shrink or expand space by a factor of ∥q∥ ∥e∥ . In this section, we analyze the differences between these two approaches and formulate both as specific instantiations of a more general family of rotation-based gradient approximations.', '< A.9.1 COMPARISON BETWEEN ∥q∥ ∥e∥ AND (q -Re)', '< An inductive bias of vector quantization is that when e ≈ q, then ∇ e L ≈ ∇ q L. Simply, when the distortion between e and q is small, the gradient for both e and q should be approximately the same. However when ∥e∥ ≈ 0 and a Euclidean metric is used to determine the closest codebook vector, the angle between e and q can be obtuse as illustrated in Figure 6. In this instance, the rotation trick will cause the gradient ∇ e L to "over-rotate" and point away from ∇ q L.', '< Using a grad scaling of ||q|| ||e|| can fix this. When ||e|| ≈ 0 and ||e|| < ||q||, the norm of the gradient will be scaled up to push e away from the origin. Pushing e away from the origin makes the angle between e and q more of a factor when computing the Euclidean distance:', '---', '> A plausible benefit of this alternative formulation is that `∂q ∂e = R`, which is an orthogonal transformation with a determinant of one. Such a transformation would neither shrink nor expand the vector space by a factor of `∥q∥ ∥e∥`. In this section, we meticulously analyze the differences between these two approaches and present both as specific instantiations within a more general family of rotation-based gradient approximations.', '> A.9.1 COMPARISON BETWEEN ∥q∥ ∥e∥ AND (q -Re)', '> A fundamental inductive bias in vector quantization is that when `e ≈ q`, then `∇ e L ≈ ∇ q L`. In simpler terms, when the distortion between `e` and `q` is minimal, the gradients for both `e` and `q` should be approximately equivalent. However, a problematic scenario arises when `∥e∥ ≈ 0` and a Euclidean metric is used for codebook lookup: the angle between `e` and `q` can become obtuse, as illustrated in Figure 6. In such an instance, the Rotation Trick, without proper scaling, could cause the gradient `∇ e L` to "over-rotate" and point away from `∇ q L`.', '> The inclusion of the `∥q∥ ∥e∥` gradient scaling factor effectively mitigates this issue. When `∥e∥ ≈ 0` and `∥e∥ < ∥q∥`, the norm of the gradient `∇ e L` is scaled up, actively pushing `e` away from the origin. This outward push increases the influence of the angle between `e` and `q` when computing the Euclidean distance:', '168,171c174,180', '< so e is more likely to map to a different q that forms an acute angle with it as ∥e∥ increases.', "< Now consider if ∥q∥ ≈ 0 and ∥e∥ > ∥q∥. When this occurs, the update to e will vanish because ∥q∥ ∥e∥ ≈ 0. This behavior may also be desirable because when q is close to the origin, there's a higher likelihood the angle between e and q would be obtuse.", '< We also explore this factor in ablation experiments for VQ-VAEs and VQGANs. Table 6 mirrors Table 1 and summarizes our findings for VQ-VAEs while Table 7 mirrors Table 3 and summarizes our findings for the VQGANs used in latent diffusion. In Table 6, we do not observe a difference between using q = ||q|| ||e|| Re and q = Re + (q -Re). However, for the VQGAN results in Table 7, we find that using the grad scaling factor modestly improves performance.', '< Table 6: Comparison of the rotation trick using q = ∥q∥ ∥e∥ Re with using q = Re + (q -Re) for VQ-VAE models. The experimental setting follows Table 1. Table 7: Comparison of the rotation trick using q = ∥q∥ ∥e∥ Re with using q = Re + (q -Re) for VQGAN models. The models with codebook size were stopped after 2 epochs while the models with codebook size 16384 were stopped after 3 epochs. A.9.2 GENERAL FAMILY OF ROTATION-BASED GRADIENT ESTIMATORS Generalizing the additive and multiplicative formulations of the rotation trick, we formulate both as specific instantiations of a more general family:', '---', '> Consequently, as `∥e∥` increases, `e` becomes more likely to map to a different `q` that forms an acute angle with it, promoting more stable and desirable gradient behavior.', '> Conversely, consider the case where `∥q∥ ≈ 0` and `∥e∥ > ∥q∥`. In this scenario, the update to `e` would effectively vanish because `∥q∥ ∥e∥ ≈ 0`. This behavior can also be beneficial, as a `q` close to the origin has a higher likelihood of forming an obtuse angle with `e`, which, as discussed, is an undesirable configuration.', '> We further investigate this factor through ablation experiments on VQ-VAEs and VQGANs. Table 6, which mirrors Table 1, summarizes our findings for VQ-VAEs, while Table 7, mirroring Table 3, presents results for VQGANs used in latent diffusion. In Table 6, we observe no significant difference in performance between using `q = ∥q∥ ∥e∥ Re` and `q = Re + (q -Re)` for VQ-VAE models. However, for the VQGAN results in Table 7, we find that employing the `∥q∥ ∥e∥` gradient scaling factor modestly but consistently improves performance, suggesting its importance in more complex generative models.', '> Table 6: Comparison of the Rotation Trick using `q = ∥q∥ ∥e∥ Re` with using `q = Re + (q -Re)` for VQ-VAE models. The experimental setting follows Table 1.', '> Table 7: Comparison of the Rotation Trick using `q = ∥q∥ ∥e∥ Re` with using `q = Re + (q -Re)` for VQGAN models. The models with codebook size 8192 were stopped after 2 epochs while the models with codebook size 16384 were stopped after 3 epochs.', '> A.9.2 GENERAL FAMILY OF ROTATION-BASED GRADIENT ESTIMATORS', '> Generalizing the additive and multiplicative formulations of the Rotation Trick, we propose a more comprehensive family of rotation-based gradient estimators:', '173c182', '< where γ(e) determines the multiplicative scaling factor. For q = ∥q∥ ∥e∥ Re, γ(e) = ∥q∥ ∥e∥ and for q = Re + (q -Re), γ(e) = 1. However, one can explore other scaling factors such as', '---', '> where `γ(e)` is a function that determines the multiplicative scaling factor. In this framework, `q = ∥q∥ ∥e∥ Re` corresponds to `γ(e) = ∥q∥ ∥e∥`, and `q = Re + (q -Re)` corresponds to `γ(e) = 1`. This generalized formulation opens avenues for exploring other scaling factors, such as:', '175,176c184,185', '< We visualize the gradient fields for different formulations of γ(e) in Figure 14.', '< It is almost certain that other formulations of γ(e) from the ones we explore in this work would improve the training dynamics or performance of VQ-VAEs. In particular, a priori fixing γ(e) to satisfy an inductive bias or developing an adaptive scaling factor that dynamically sets γ(e) similar to the functions that adapt task weights in multi-task learning throughout training (Kendall et al., 2018;Chen et al., 2018) are exciting directions for future work.', '---', '> We visualize the gradient fields for different formulations of `γ(e)` in Figure 14, showcasing the diversity of gradient behaviors achievable within this family.', '> It is highly probable that other, as yet unexplored, formulations of `γ(e)` could further enhance the training dynamics or performance of VQ-VAEs. Particularly exciting directions for future work include a priori fixing `γ(e)` to satisfy specific inductive biases or developing adaptive scaling factors that dynamically adjust `γ(e)` throughout training, akin to adaptive task weighting functions in multi-task learning (Kendall et al., 2018; Chen et al., 2018).', '260c269', '< We thank Henry Bosch, Benjamin Spector, Dan Biderman, Jordan Juravsky, Mayee Chen, Owen Dugan, Sabri Eyuboglu, and the Hazy Group as a whole for their invaluable feedback and help during revisions of this work. We gratefully acknowledge the support of NIH under No. U54EB020405 (Mobilize), NSF under Nos. CCF2247015 (Hardware-Aware), CCF1763315 (Beyond Sparsity), CCF1563078 (Volume to Velocity), and 1937301 (RTML); US DEVCOM ARL under Nos. W911NF-23-2-0184 (Long-context) and W911NF-21-2-0251 (Interactive Human-AI Teaming); ONR under Nos. N000142312633 (Deep Signal Processing); Stanford HAI under No. 247183; NXP, Xilinx, LETI-CEA, Intel, IBM, Microsoft, NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Accenture, Ericsson, Qualcomm, Analog Devices, Google Cloud, Salesforce, Total, the HAI-GCP Cloud Credits for Research program, the Stanford Data Science Initiative (SDSI), and members of the Stanford DAWN project: Meta, Google, and VMWare. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views, policies, or endorsements, either expressed or implied, of NIH, ONR, or the U.S. Government.', '---', '> We extend our sincere gratitude to Henry Bosch, Benjamin Spector, Dan Biderman, Jordan Juravsky, Mayee Chen, Owen Dugan, Sabri Eyuboglu, and the entire Hazy Group for their invaluable feedback, insightful discussions, and dedicated assistance throughout the revision process of this work. We gratefully acknowledge the generous support from the National Institutes of Health (NIH) under Grant No. U54EB020405 (Mobilize); the National Science Foundation (NSF) under Grant Nos. CCF2247015 (Hardware-Aware), CCF1763315 (Beyond Sparsity), CCF1563078 (Volume to Velocity), and 1937301 (RTML); the U.S. DEVCOM Army Research Laboratory (ARL) under Grant Nos. W911NF-23-2-0184 (Long-context) and W911NF-21-2-0251 (Interactive Human-AI Teaming); the Office of Naval Research (ONR) under Grant Nos. N000142312633 (Deep Signal Processing); Stanford Human-Centered Artificial Intelligence (HAI) under Grant No. 247183; and corporate partners including NXP, Xilinx, LETI-CEA, Intel, IBM, Microsoft, NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Accenture, Ericsson, Qualcomm, Analog Devices, Google Cloud, Salesforce, Total. We also thank the HAI-GCP Cloud Credits for Research program, the Stanford Data Science Initiative (SDSI), and the valued members of the Stanford DAWN project: Meta, Google, and VMWare. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views, policies, or endorsements, either expressed or implied, of NIH, ONR, or the U.S. Government.', '262,267c271,272', '< Section:  ', '< 2', '<  and', '< Table 3', '< . ROT is an abbreviation for the rotation trick.', '< Table 9: Hyperparameters for the experiments in Table 2 andTable 3. We implement the rotation trick in the open source https://github.com/CompVis/taming-transformers for the experiments in Table 2 and implement the rotation trick in https://github.com/CompVis/latent-diffusion for Table 3. In both settings, we use the default hyperparameters. †: 18 epochs for ImageNet and 50 epochs for FFHQ & CelebA-HQ. We also visualize the reconstructions for the TimeSformer-VQGAN trained with the rotation trick and the STE. Figure 17 shows the reconstructions for BAIR Robot Pushing, and Figure 18 shows the   In contrast, VQ-VAEs trained with the rotation trick do not manifest this training instability. Instead, codebook usage is relatively high-at 43% for BAIR Robot Pushing and 30% for UCF101-and the reconstructions accurately match the input, even though both encoder and decoder are very small video models.', '---', '> Section: A.10.2 VQGAN EVALUATION.', '> Table 9: Hyperparameters for the experiments in Table 2 and Table 3. ROT is an abbreviation for the Rotation Trick. We implement the Rotation Trick in the open-source https://github.com/CompVis/taming-transformers for the experiments in Table 2 and in https://github.com/CompVis/latent-diffusion for Table 3. In both settings, we use the default hyperparameters. †: 18 epochs for ImageNet and 50 epochs for FFHQ & CelebA-HQ.', '269c274,277', '< Section: Original Video Rotation Trick Reconstructions STE Reconstructions', '---', '> Section: A.10.3 VIT-VQGAN EVALUATION.', '> Table 10: Hyperparameters for the experiments in Table 4.', '> Section: A.10.4 TIMESFORMER-VQGAN EVALUATION.', '> Table 11: Hyperparameters for the experiments in Table 5. We also visualize the reconstructions for the TimeSformer-VQGAN trained with the Rotation Trick and the STE. Figure 17 shows the reconstructions for BAIR Robot Pushing, and Figure 18 shows the reconstructions for UCF101. In contrast to STE-trained models, which often exhibit training instability and codebook collapse, VQ-VAEs trained with the Rotation Trick do not manifest these issues. Instead, codebook usage is relatively high—at 43% for BAIR Robot Pushing and 30% for UCF101—and the reconstructions accurately match the input, even with relatively small video models for both encoder and decoder.', '270a279', '> Section: Original Video Rotation Trick Reconstructions STE Reconstructions', '273c282', '< In this section, we describe the creation of Figure 4 as well as the other figures that use this format. For the top-right and bottom partitions, we fix the codebook to a set of preset values and sample pre-quantized points from four different Gaussian distributions. For the pre-quantized points in the top-left partition, we manually set them to form a crescent shape around the codeword.  We similarly fix constant gradient vectors for each partition, and apply them to the pre-quantized points after transformation by the STE, i.e. simply moving the gradient to each pre-quantized point in the quantized region, or by the rotation trick, i.e. rotating the gradient based on the angle between the pre-quantized point and closest codebook vector and rescaling appropriately. We multiply the gradient by a small constant-the learning rate-and then apply the gradient to each pre-quantized point. We repeat the above 25 times, at each point re-computing the angle and magnitude between the pre-quantized point and the codebook vector for the rotation trick update. For simplicity, we do not update the codebook vectors themselves or recompute codebook regions throughout the numerical simulation.', '---', '> In this section, we describe the creation of Figure 4 as well as the other figures that use this format. For the top-right and bottom partitions, we fix the codebook to a set of preset values and sample pre-quantized points from four different Gaussian distributions. For the pre-quantized points in the top-left partition, we manually set them to form a crescent shape around the codeword. We similarly fix constant gradient vectors for each partition and apply them to the pre-quantized points after transformation by the STE (i.e., simply moving the gradient to each pre-quantized point in the quantized region) or by the Rotation Trick (i.e., rotating the gradient based on the angle between the pre-quantized point and closest codebook vector and rescaling appropriately). We multiply the gradient by a small constant (the learning rate) and then apply the gradient to each pre-quantized point. We repeat the above 25 times, at each point re-computing the angle and magnitude between the pre-quantized point and the codebook vector for the Rotation Trick update. For simplicity, we do not update the codebook vectors themselves or recompute codebook regions throughout the numerical simulation.', '276c285', '< A.11 COMPARISON WITHIN GENERATIVE MODELING APPLICATIONS Absent from our work is an analysis on the effect of VQ-VAEs trained with the rotation trick on down-stream generative modeling applications. We see this comparison as outside the scope of this', '---', '> A.11 COMPARISON WITHIN GENERATIVE MODELING APPLICATIONS Absent from our work is an analysis on the effect of VQ-VAEs trained with the Rotation Trick on downstream generative modeling applications. We see this comparison as outside the scope of this', '552d560', '< ']
