Chimera: State Space Models Beyond Sequences

Aakash Lahoti; Tanya Marwah; Ratish Puduppully; Albert Gu

Chimera: State Space Models Beyond Sequences

Aakash Lahoti, Tanya Marwah, Ratish Puduppully, Albert Gu

Published: 06 Mar 2025, Last Modified: 13 Apr 2025ICLR 2025 DeLTa Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 8 pages)

Keywords: State Space Models, Mamba, Graph Neural Networks

TL;DR: Chimera directly integrates data's graph topology, without position embeddings, by generalizing State Space Models (SSMs)

Abstract: Powerful deep learning methods based on Transformers are used to model diverse data modalities such as sequences, images, and graphs. These methods rely on self-attention, which treats data as an unordered set of elements. This ignores the neighborhood structure or graph topology of the data and requires the use of inductive biases, such as position embeddings, to incorporate topology. However, developing bespoke inductive biases for each task requires significant effort and can introduce side effects that hinder generalization. In this work, we introduce Chimera, a unified model that directly incorporates the data topology in a principled way, bypassing the need for domain-specific biases. Central to Chimera is the observation that state-space models---which naturally do not require position embeddings---can be generalized to capture the any general graph topology. Our model achieves state-of-the-art performance across language, vision, and graphs, outperforming BERT on GLUE by 0.7 points, ViT on ImageNet-1k by 2.6%, and all the baselines on the Long Range Graph Benchmark, demonstrating that it is capable of modeling both short and long range interactions between nodes. Our results validate Chimera principled methodological contributions and affirm the long-held belief that data topology is a powerful inductive bias across modalities. We further propose algorithmic optimizations to improve Chimera's efficiency while maintaining performance: 1) For the subclass of Directed Acyclic Graphs we show that Chimera can be implemented as a linear time recurrence. 2) For general graphs, we relax the method with a simple mathematical approximation, achieving Transformer's quadratic complexity without relying on domain-specific biases.

Submission Number: 57

Loading