TriLM vs FloatLM: Ternary LLMs are more Performant than Quantized FP16 LLMs

Ayush Kaushal; Tejas Vaidhya; Tejas Pandey; Aaryan Bhagat; Irina Rish

TriLM vs FloatLM: Ternary LLMs are more Performant than Quantized FP16 LLMs

Ayush Kaushal, Tejas Vaidhya, Tejas Pandey, Aaryan Bhagat, Irina Rish

Published: 03 Jul 2024, Last Modified: 19 Jul 2024ICML 2024 FM-Wild Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Quantisation, Large Language models, Ternary Large Language Models, Post Training Quantization

TL;DR: This paper explores the potential benefits of ternary Large Language Models (TriLMs) by comparing their performance with their floating-point (FloatLMs) and quantized (QuantLMs) counterparts across various benchmarks and scales.

Abstract: Ternary LLMs offer significantly better performance for their size (measured in bits) than the models trained and deployed in FP16/BF16. Given the widespread usage of quantization before deployment and advancements in Post Training Quantization of LLMs, a pivotal question arises: do ternary LLMs indeed provide any discernible benefits? To address this, we first build an open family of pre-trained ternary Large Language Models (TriLM). Additionally, we include their counterparts pre-trained in FP16 (FloatLM) and quantized versions of FloatLM (QuantLM) with parameters across almost two orders of magnitude - from 99M to 3.9B parameters. We demonstrate that TriLMs with 3B+ parameters start to offer competitive performance compared to FloatLMs with the same parameter count, while providing significantly better performance for their size. TriLMs also outperform quantized models, with TriLM 3.9B surpassing the larger QuantLM-3bit 3.9B. Furthermore, across knowledge-based benchmarks, TriLM maintains a superiority for its size. To advance research on Ternary LMs, we open source over 500+ checkpoints across the model families at https://github.com/NolanoOrg/SpectraSuite.

Submission Number: 98

Loading