Post Training Quantization of Large Language Models with Microscaling Formats

ACL ARR 2024 June Submission2891 Authors

15 Jun 2024 (modified: 22 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) have distinguished themselves with outstanding performance in complex language modeling tasks, yet they come with significant computational and storage challenges. This paper explores the potential of quantization to mitigate these challenges. We systematically study the combined application of three well-known post-training techniques, SmoothQuant, AWQ, and GPTQ, and provide a comprehensive analysis of their interactions and implications for advancing LLM quantization. We enhance the versatility of these techniques by enabling quantization to microscaling (MX) formats, expanding their applicability beyond their initial fixed-point format targets. We show that combining different PTQ methods enables us to quantize models to 4-bit weights and 8-bit activations using the MXINT format with negligible accuracy loss compared to the uncompressed baseline.
Paper Type: Short
Research Area: Special Theme (conference specific)
Research Area Keywords: Quantization, NLP in resource-constrained settings, efficient inference
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 2891
Loading