DuaRot: Dual Rotation for Advanced Outlier Mitigation in Rotated LLMs

Jingyang Xiang; Ying Zhang; Chi Ma; Yujie Wang; yulei; LiuChuan; Wei Lin; Yong Liu

DuaRot: Dual Rotation for Advanced Outlier Mitigation in Rotated LLMs

Jingyang Xiang, Ying Zhang, Chi Ma, Yujie Wang, yulei, LiuChuan, Wei Lin, Yong Liu

13 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language model, Walsh–Hadamard transform, reparameterization, hardware-aware, rotational invariance

TL;DR: We propose a dual rotation method based on reparameterization and hardware-aware matrix configuration strategy to improve performance for quantization LLMs.

Abstract: By employing rotation, outliers in activations can be effectively mitigated without altering the output, thereby facilitating the quantization of large language models (LLMs). However, existing rotation-based methods only consider global activation distributions, leaving the finer-grained distributions underexplored. Additionally, these methods predominantly rely on the Walsh–Hadamard transform (WHT) to accelerate online rotation operations, while not fully considering performance between matrix multiplication~(Matmul) and WHT in actual runtime. These limitations hinder the rotation's ability to effectively reduce quantization errors and decrease inference speed. Therefore, improvements are needed in their performance regarding both accuracy and speed. In this paper, we propose a dual rotation method for rotation matrices, dubbed DuaRot, based on reparameterization. During training, DuaRot sequentially refines global and local features to achieve effective outlier mitigation. During inference, global and local rotations can be merged, which maintains rotational invariance without introducing additional computational overhead. Meanwhile, we propose a hardware-aware matrix configuration strategy, which determines whether the online Hadamard matrix should be expanded into a trainable parameter space by taking the runtime of the WHT and Matmul into account. This approach further enhances the reduction of quantization errors in online rotation operations without compromising inference speed. Extensive experiments demonstrate that DuaRot outperforms existing methods across various models and quantization configurations. For instance, when applied to LLaMA3-8B, DuaRot achieves WikiText-2 perplexities of 7.49 and 7.41 under W4A4KV4 and W4A4KV16 configurations with Round-to-Nearest (RTN), improving by 0.51 and 0.41 over the state-of-the-art, respectively. The code will be publicly available soon.

Supplementary Material: pdf

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 319

Loading