ORCHID: FLEXIBLE AND DATA-DEPENDENT CONVO- LUTION FOR SEQUENCE MODELING

ICLR 2024 Workshop ME-FoMo Submission104 Authors

Published: 04 Mar 2024, Last Modified: 05 May 2024ME-FoMo 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Transformer, Attention, Convolution, BERT, LLM, structured state-space models, Vision Transformers, Subquadratic LLM
TL;DR: The paper presents Orchid, a novel architecture that reimagines sequence modeling by integrating a new data-adaptive convolution mechanism.
Abstract: In the rapidly evolving landscape of deep learning, the quest for models that balance expressivity with computational efficiency has never been more critical. Orchid is designed to address the quadratic computational complexity of attention models without sacrificing the model's ability to capture long-range dependencies. At the core of Orchid lies the data-adaptive convolution layers, which conditionally adjust their kernels based on input data using a conditioning neural network. This innovative approach enables the model to maintain scalability and efficiency for long sequence lengths. The adaptive nature of data-adaptive convolution kernel combined with the gating operations allows it to offer a highly expressive neural network. We rigorously evaluate Orchid across multiple domains, including language modeling and image classification, to showcase its generality and performance. Our experiments demonstrate that Orchid not only consistently outperforms traditional attention-based architectures in most scenarios but also extends the feasible sequence length beyond the constraints of dense attention layers. This achievement marks a significant milestone in the pursuit of more efficient and scalable deep learning models for sequence modeling.
Submission Number: 104
Loading