Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing

Aviv Bick; Tobias Katsch; Nimit Sharad Sohoni; Arjun D Desai; Albert Gu

Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing

Aviv Bick, Tobias Katsch, Nimit Sharad Sohoni, Arjun D Desai, Albert Gu

Published: 05 Mar 2025, Last Modified: 14 Apr 2025SCOPE - ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: Main paper track (up to 5 pages excluding references and appendix)

Keywords: LLM;Mamba;distillation;efficiency;

TL;DR: Distilled recurrent language models offering fast inference, high-quality performance with minimal data, and optimized deployment for resource-constrained devices.

Abstract: We present the Llamba model series, a family of highly efficient recurrent language models distilled from the Llama-3.x family into the Mamba architecture. The series includes Llamba-1B, Llamba-4B, and Llamba-8B, delivering fast inference throughput while maintaining competitive benchmark performance. Beyond its computational advantages, Llamba showcases the effectiveness of the MOHAWK distillation framework, achieving high-quality performance while being distilled with less than 0.1\% of the data typically used for models of similar size. We also provide an optimized implementation of the Llamba models for deployment on resource-constrained devices, such as smartphones and edge platforms, providing a practical and memory-efficient alternative to traditional Transformer architectures. Overall, these models set new standards for speed, memory efficiency, and accessibility of language models.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 5

Loading