AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference
Abstract: As large language models (LLMs) grow in parameter size and context length, computation precision has been reduced from 16-bit to 4-bit to improve inference efficiency. However, this reduction causes accuracy degradation due to activation outliers. Recent rotation-based INT4 quantization attempts to address this through rotation matrix calibration, but they require hours of overhead per model deployment and leave significant computation unquantized in long-context scenarios. Microscaling (MX) floating-point (FP) formats offer fine-grained representation with a shared scale, enabling fully quantized matrix multiplications through direct casting without calibration. However, existing research shows unsatisfactory empirical results for MXFP4 inference, and the robustness of MX formats remains largely unexplored.
In this work, we uncover the fundamental tradeoffs of the MX format: while it effectively suppresses activation outliers, it does so at the cost of increased group-wise asymmetry. To address this, we propose an asymmetric MX format for a 4-bit floating point (AMXFP4), which employs asymmetric shared scales to handle both outliers and group-wise asymmetry without requiring calibration. Our custom compute-engine implementation shows that the AMXFP4-based Multiply-Accumulate (MAC) design adds marginal resource overhead while delivering substantial accuracy improvements. Extensive experiments across benchmarks demonstrate that AMXFP4 outperforms MXFP4 in visual question answering (VQA) by 3\% and surpasses rotation-based techniques on CSQA by 1.6\%. Additionally, AMXFP4 shows superior performance compared to the recently deployed commercial MXFP4 format.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: Quantization, NLP in resource-constrained settings
Contribution Types: Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 5674
Loading