RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

Daniel Goldstein; Eric Alcaide; Janna Lu; Eugene Cheah

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

Daniel Goldstein, Eric Alcaide, Janna Lu, Eugene Cheah

Published: 08 Jul 2025, Last Modified: 26 Aug 2025COLM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: RADLAD, RADLADS, RWKV, Linear Attention, Conversion, LLM, SUPRA, MOHAWK, LolCATs, Hedgehog, DiJiang, Mamba in the Llama

TL;DR: RADLADS is a process for rapidly converting transformers into linear attention decoder models, and we release a set of SoTA models converted to custom RWKV variants via this process. It costs less than $2000 USD to convert a 72B model.

Abstract: We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005\% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than \$2,000 USD at today's prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our code on GitHub and models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the Qwen License Agreement.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 871

Loading