Pretrained Hybrids with MAD Skills

Nicholas Roberts; Samuel Guo; Zhiqi Gao; Satya Sai Srinath Namburi GNVV; Sonia Cromp; Chengjun Wu; Chengyu Duan; Frederic Sala

Pretrained Hybrids with MAD Skills

Nicholas Roberts, Samuel Guo, Zhiqi Gao, Satya Sai Srinath Namburi GNVV, Sonia Cromp, Chengjun Wu, Chengyu Duan, Frederic Sala

23 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: hybrid architectures, large language models, transformers, state space models, model merging, neural architecture search, mechanistic search

TL;DR: We develop a framework for creating pretrained hybrid models from existing pretrained models.

Abstract: While Transformers underpin modern large language models (LMs), a growing list of alternative architectures with new capabilities, promises, and tradeoffs is emerging. This makes choosing the right LM architecture challenging. Recently proposed *hybrid architectures* seek a best-of-all-worlds approach that reaps the benefits of all architectures. Hybrid design is difficult for two reasons: it requires manual expert-driven search, and new hybrids must be trained from scratch. We propose **Manticore**, a framework that addresses these challenges by *automating the design of hybrid architectures* while reusing pretrained models to create *pretrained* hybrids. Our approach augments ideas from differentiable Neural Architecture Search (NAS) by incorporating simple projectors that translate features between pretrained blocks from different architectures. We then fine-tune hybrids that combine pretrained models from different architecture families---such as the GPT series and Mamba---end-to-end. With Manticore, we enable LM selection without training multiple models, the construction of pretrained hybrids from existing pretrained models, and the ability to *program* pretrained hybrids to have certain capabilities. Manticore hybrids match existing manually-designed hybrids, achieve strong performance on the Long Range Arena benchmark, and improve on pretrained transformers and state space models on various natural language tasks.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3254

Loading