Zebra-Llama: Towards Extremely Efficient Hybrid Models

Mingyu Yang; Mehdi Rezagholizadeh; Guihong Li; Vikram Appia; Emad Barsoum

Zebra-Llama: Towards Extremely Efficient Hybrid Models

Mingyu Yang, Mehdi Rezagholizadeh, Guihong Li, Vikram Appia, Emad Barsoum

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Keywords: Model Efficiency, Large Language Models

Abstract: With the growing demand for deploying large language models (LLMs) across diverse applications, improving their inference efficiency is crucial for sustainable and democratized access. However, retraining LLMs to meet new user-specific requirements is prohibitively expensive and environmentally unsustainable. In this work, we propose a practical and scalable alternative: composing efficient hybrid language models from existing pre-trained models. Our approach, X-EcoMLA, introduces a family of 1B, 3B, and 8B hybrid models by combining State Space Models (SSMs) and Multi-head Latent Attention (MLA) layers, using a refined initialization and post-training pipeline to efficiently transfer knowledge from pre-trained Transformers. X-EcoMLA achieves Transformer-level accuracy with near-SSM efficiency using only 7–11 billion training tokens (compared to the trillions required for pre-training) and an 8B teacher. Moreover, it dramatically reduces KV cache size—down to 3.9%, 2%, and 2.73% of the original for the 1B, 3B, and 8B variants, respectively—while preserving 100%, 100%, and over 97% of average zero-shot performance on LM Harness tasks. Compared to models like MambaInLLaMA, X-EcoMLA, Minitron, and Llamba, our approach consistently delivers competitive or superior accuracy while using significantly fewer tokens, smaller teachers, and vastly reduced KV cache memory. Notably, X-EcoMLA-8B surpasses Minitron-8B in few-shot accuracy by 7%, while using 8× fewer training tokens, over 12× smaller KV cache, and a smaller teacher (8B vs. 15B). It also achieves 1.4x–3.3x higher throughput (tokens/s) than MambaInLlama. The source code is released at https://github.com/AMD-AGI/AMD-Hybrid-Models.

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 19621

Loading