Quantized LLM Reasoning: A Comprehensive Study of Post-Training Quantization in LLMs

Quantized LLM Reasoning: A Comprehensive Study of Post-Training Quantization in LLMs

ACL ARR 2025 February Submission4593 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract:

Quantization plays a crucial role in enabling the efficient deployment of large language models (LLMs) on memory-constrained hardware, significantly reducing memory usage and computational costs. However, extreme low-bit quantization methods often impair essential capabilities such as complex reasoning, memory retention, and adherence to instructions. In this work, we systematically evaluate state-of-the-art quantization techniques on tasks involving chain-of-thought reasoning, instruction following, and multi-agent simulations. Additionally, we investigate partial and stochastic 1-bit quantization, i.e., binarization strategies that aim to preserve key reasoning capabilities, achieving a balance between model compression and performance retention. To evaluate the effectiveness of low-bit LLMs in advanced scenarios like multi-agent simulations, we curated two novel datasets, USMLE and NHS, to overcome the challenge of data scarcity in the domain of medical simulation and reasoning. Our experiments on LLaMA, LLaMA-3.1, and LLaMA-3.3, as well as a reasoning-centric benchmark, demonstrate the potential of quantized models in maintaining functional integrity under extreme compression. The code for our work will be publicly available.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: Quantization, fine-tuning, Partially Binarized LLMs, multi-agnet llms

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency, Data resources

Languages Studied: English

Submission Number: 4593

Loading