Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation

Liliang Ren; Congcong Chen; Haoran Xu; Young Jin Kim; Adam Atkinson; Zheng Zhan; Jiankai Sun; Baolin Peng; Liyuan Liu; Shuohang Wang; Hao Cheng; Jianfeng Gao; Weizhu Chen; yelong shen

Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation

Liliang Ren, Congcong Chen, Haoran Xu, Young Jin Kim, Adam Atkinson, Zheng Zhan, Jiankai Sun, Baolin Peng, Liyuan Liu, Shuohang Wang, Hao Cheng, Jianfeng Gao, Weizhu Chen, yelong shen

Published: 11 Jun 2025, Last Modified: 10 Jul 2025ES-FoMo IIIEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Hybrid Neural Architectures;Long Sequence Models;State Space Models;Linear Recurrent Neural Networks

Abstract: Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid SSM architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers, and apply it to create a decoder-hybrid-decoder architecture, SambaY, through integrating GMUs into the cross-decoder of YOCO. SambaY significantly enhances decoding efficiency, preserves linear pre-filling time complexity, and boosts long-context performance, all while eliminating the need for explicit positional encoding. Through extensive scaling experiments, we demonstrate that our architecture exhibits a significantly lower irreducible loss compared to a strong YOCO baseline, indicating superior performance scalability under large-scale compute regimes. Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves comparable performance to Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24, and GPQA Diamond, while delivering up to 10× higher decoding throughput on 2K-length prompts with 32K generation length under the vLLM inference framework.

Submission Number: 156

Loading