Track: Main paper track (up to 5 pages excluding references and appendix)
Keywords: LLM;Mamba;distillation;efficiency;
TL;DR: Distilled recurrent language models offering fast inference, high-quality performance with minimal data, and optimized deployment for resource-constrained devices.
Abstract: We present the Llamba model series, a family of highly efficient recurrent language models distilled from the Llama-3.x family into the Mamba architecture. The series includes Llamba-1B, Llamba-4B, and Llamba-8B, delivering fast inference throughput while maintaining competitive benchmark performance. Beyond its computational advantages, Llamba showcases the effectiveness of the MOHAWK distillation framework, achieving high-quality performance while being distilled with less than 0.1\% of the data typically used for models of similar size. We also provide an optimized implementation of the Llamba models for deployment on resource-constrained devices, such as smartphones and edge platforms, providing a practical and memory-efficient alternative to traditional Transformer architectures. Overall, these models set new standards for speed, memory efficiency, and accessibility of language models.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 5
Loading