AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference

AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference

ACL ARR 2025 February Submission5674 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: As large language models (LLMs) grow in parameter size and context length, computation precision has been reduced from 16-bit to 4-bit to improve inference efficiency. However, this reduction causes accuracy degradation due to activation outliers. Recent rotation-based INT4 quantization attempts to address this through rotation matrix calibration, but they require hours of overhead per model deployment and leave significant computation unquantized in long-context scenarios. Microscaling (MX) floating-point (FP) formats offer fine-grained representation with a shared scale, enabling fully quantized matrix multiplications through direct casting without calibration. However, existing research shows unsatisfactory empirical results for MXFP4 inference, and the robustness of MX formats remains largely unexplored. In this work, we uncover the fundamental tradeoffs of the MX format: while it effectively suppresses activation outliers, it does so at the cost of increased group-wise asymmetry. To address this, we propose an asymmetric MX format for a 4-bit floating point (AMXFP4), which employs asymmetric shared scales to handle both outliers and group-wise asymmetry without requiring calibration. Our custom compute-engine implementation shows that the AMXFP4-based Multiply-Accumulate (MAC) design adds marginal resource overhead while delivering substantial accuracy improvements. Extensive experiments across benchmarks demonstrate that AMXFP4 outperforms MXFP4 in visual question answering (VQA) by 3\% and surpasses rotation-based techniques on CSQA by 1.6\%. Additionally, AMXFP4 shows superior performance compared to the recently deployed commercial MXFP4 format.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: Quantization, NLP in resource-constrained settings

Contribution Types: Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 5674

Loading