Abstract: Quantization plays a crucial role in enabling the efficient deployment of large language models (LLMs) on memory-constrained hardware, significantly reducing memory usage and computational costs. However, extreme low-bit quantization methods often impair essential capabilities such as complex reasoning, memory retention, and adherence to instructions. In this work, we systematically evaluate state-of-the-art quantization techniques on tasks involving chain-of-thought reasoning, instruction following, and multi-agent simulations. Additionally, we investigate partial and stochastic 1-bit quantization, i.e., binarization strategies that aim to preserve key reasoning capabilities, achieving a balance between model compression and performance retention. To evaluate the effectiveness of low-bit LLMs in advanced scenarios like multi-agent simulations, we curated two novel datasets, USMLE and NHS, to overcome the challenge of data scarcity in the domain of medical simulation and reasoning. Our experiments on LLaMA, LLaMA-3.1, and LLaMA-3.3, as well as a reasoning-centric benchmark, demonstrate the potential of quantized models in maintaining functional integrity under extreme compression. The code for our work will be publicly available.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: Quantization, fine-tuning, Partially Binarized LLMs, multi-agnet llms
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency, Data resources
Languages Studied: English
Submission Number: 4593