FACET: On-the-Fly Activation Compression for Efficient Transformer Training

Published: 01 Jan 2025, Last Modified: 07 Nov 2025IEEE Trans. Circuits Syst. I Regul. Pap. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Training Transformer models, known for their outstanding performance in various tasks, can be challenging due to extensive training times and substantial memory requirements. One promising approach to minimize the memory footprint and accelerate training is compressing activations, which account for more than half of the total memory usage for large batch sizes. However, conventional compression schemes, such as FP8 quantization, may not adequately represent the dynamic range of Transformer activations, potentially leading to unsatisfactory accuracy. To address this issue, we propose FACET, a lightweight yet effective Transformer activation compressor and its corresponding hardware design. This compressor comprises a base-delta compression (BDC) and a bit-plane compression (BPC), targeting the exponent and sign/mantissa of activation data, respectively. The bitstreams generated by BDC and BPC are concatenated and then truncated to a target size, e.g., 8 bits per data. Experimental results with popular Transformer models (BERT, GPT-2, and T5) indicate that FACET reduces activation memory by 2- $4\times $ , with negligible accuracy degradation. We implemented our compressor in hardware and synthesized it using the 45nm TSMC process library. The encoder and decoder require 16K and 12K gate counts, respectively, exhibiting a 2.2- $3.8\times $ smaller overhead compared to other compressors. We also propose a system that integrates our compressor within memory with minimal system modifications, leveraging the small overhead of the compressor.
Loading