BiXT: Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers

Markus Hiller; Krista A. Ehinger; Tom Drummond

BiXT: Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers

Markus Hiller, Krista A. Ehinger, Tom Drummond

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: representation learning for computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Transformers, representation learning, efficiency, efficient attention, neural architectures

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: Linearly scaling Transformer architecture using bi-directional cross-attention as efficient means of information refinement, exploiting the observation of naturally emerging symmetric cross-attention patterns.

Abstract: We present a novel bi-directional Transformer architecture (BiXT) for which computational cost and memory consumption scale linearly with input size, but without suffering the drop in performance or limitation to only one input modality seen with other efficient Transformer-based approaches. BiXT is inspired by the Perceiver architectures but replaces iterative attention with an efficient bi-directional cross-attention module in which input tokens and latent variables attend to each other simultaneously, leveraging a naturally emerging attention-symmetry between the two. This approach unlocks a key bottleneck experienced by Perceiver-like architectures and enables the processing and interpretation of both semantics (‘what’) and location (‘where’) to develop alongside each other over multiple layers – allowing its direct application to dense and instance-based tasks alike. By combiningefficiency with the generality and performance of a full Transformer architecture, BiXT can processes longer sequences like point clouds or images at higher feature resolutions. Our model achieves accuracies up to 82.0% for classification on ImageNet1K with tiny models and no modality-specific internal components, and performs competitively on semantic image segmentation (ADE20K) and point cloud part segmentation (ShapeNetPart) even against modality-specific methods.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7336

Loading