Scaling Laws and Efficient Inference for Ternary Language Models

Scaling Laws and Efficient Inference for Ternary Language Models

ACL ARR 2025 February Submission4298 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) are increasingly used across research and industry applications, yet their inference efficiency remains a significant challenge. As the computational power of modern GPU architectures continuously improves, their memory bandwidth and capacity have not scaled proportionally, creating a critical bottleneck during inference. To address this, we investigate ternary language models (TriLMs) that employ quantization-aware training to significantly reduce memory requirements. We first analyze the scalability of TriLMs by conducting a scaling law analysis, revealing that TriLMs benefit more from increasing training data than from scaling model parameters. Based on this observation, we introduce TriTera, an open suite of TriLMs trained on up to 1.2 trillion tokens, demonstrating sustained performance gains at scale. Furthermore, to improve inference efficiency, we propose novel 2-bit and 1.6-bit packing scheme for ternary weights, which demonstrate accelerated inference across various CPU architectures. Building on the 2-bit packing, we develop a GPU kernel called TriRun that accelerates end-to-end model inference by up to 5$\times$ compared to floating-point baselines. To encourage further exploration and development of TriLMs, we will release the TriTera suite and TriRun inference kernels. Overall, our work lays the foundation for building and deploying efficient LLMs, providing a valuable resource for the research community.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: Large Language Models, low-bit language models, quantization-aware training, pretraining of large language models, inference kernels and scaling laws

Contribution Types: Approaches to low-resource settings, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 4298

Loading