The Mamba in the Llama: Distilling and Accelerating Hybrid Models

Junxiong Wang; Daniele Paliotta; Avner May; Alexander M Rush; Tri Dao

The Mamba in the Llama: Distilling and Accelerating Hybrid Models

Junxiong Wang, Daniele Paliotta, Avner May, Alexander M Rush, Tri Dao

Published: 21 Jun 2024, Last Modified: 24 Jul 2024ES-FoMo-II 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mamba, Distillation, Speculative Decoding

TL;DR: The Mamba in the Llama: Distilling and Accelerating Hybrid Models

Abstract: Recent research suggests that state-space models (SSMs) like Mamba can be competitive with Transformer models for language modeling with advantageous deployment characteristics. Given the focus and expertise on training large-scale Transformer models, we consider the challenge of converting these pretrained models into SSMs for deployment. We demonstrate that it is feasible to distill large Transformers into SSMs by reusing the linear projection weights from attention layers with academic GPU resources. The resulting hybrid model, which incorporates a quarter of the attention layers, achieves performance comparable to the original Transformer. Moreover, we introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of state-space models. Overall we show how, with limited computation resources, we can distill a large Transformer into a hybrid SSM and decode it efficiently.

Submission Number: 20

Loading