Scaling Laws and Efficient Inference for Ternary Language Models

Tejas Vaidhya; Ayush Kaushal; Vineet Jain; Francis Couture-Harpin; Prashant Shishodia; Majid Behbahani; Irina Rish; Yuriy Nevmyvaka

Scaling Laws and Efficient Inference for Ternary Language Models

Tejas Vaidhya, Ayush Kaushal, Vineet Jain, Francis Couture-Harpin, Prashant Shishodia, Majid Behbahani, Irina Rish, Yuriy Nevmyvaka

Published: 05 Mar 2025, Last Modified: 09 Apr 2025SLLMEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 4 pages)

Keywords: Large Language Models, low-bit language models, quantization-aware training, pretraining of large language models, inference kernels and scaling laws

TL;DR: Scaling study of ternary large language models, introduction of the Spectra-1.1 family trained on 1.2 trillion tokens, and development of ternary inference kernels for speedups in memory-bound environments.

Abstract: Large language models (LLMs) are increasingly deployed across research and industry applications, yet their high inference cost poses a major challenge. In this work, we investigate ternary language models (TriLMs) that employ quantization-aware training to significantly reduce memory requirements as a potential solution. We present three key contributions: (1) a comprehensive scaling law analysis showing these models benefit more from scaling training data compared to their floating point counterparts; (2) the introduction of Spectra-1.1, an open-source family of state-of-the-art TriLMs trained on up to 1.2 trillion tokens, demonstrating competitive performance with Llama-1 7B; and (3) ternary kernels for efficient inference, utilizing novel 1.6-bit and 2-bit packing schemes. Notably, our GPU kernel using 2-bit packing, called TriRun, achieves up to an 8$\times$ speedup over float16 baselines, enabling efficient inference in memory-constrained environments. We will be releasing the Spectra-1.1 models along with optimized inference kernels to encourage further research on TriLM models.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 64

Loading