Track: long paper (up to 4 pages)
Keywords: Large Language Models, low-bit language models, quantization-aware training, pretraining of large language models, inference kernels and scaling laws
TL;DR: Scaling study of ternary large language models, introduction of the Spectra-1.1 family trained on 1.2 trillion tokens, and development of ternary inference kernels for speedups in memory-bound environments.
Abstract: Large language models (LLMs) are increasingly deployed across research and industry applications, yet their high inference cost poses a major challenge. In this work, we investigate ternary language models (TriLMs) that employ quantization-aware training to significantly reduce memory requirements as a potential solution. We present three key contributions: (1) a comprehensive scaling law analysis showing these models benefit more from scaling training data compared to their floating point counterparts; (2) the introduction of Spectra-1.1, an open-source family of state-of-the-art TriLMs trained on up to 1.2 trillion tokens, demonstrating competitive performance with Llama-1 7B; and (3) ternary kernels for efficient inference, utilizing novel 1.6-bit and 2-bit packing schemes. Notably, our GPU kernel using 2-bit packing, called TriRun, achieves up to an 8$\times$ speedup over float16 baselines, enabling efficient inference in memory-constrained environments. We will be releasing the Spectra-1.1 models along with optimized inference kernels to encourage further research on TriLM models.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 64
Loading