SFT: Sampling-based Foundational Transformer

TMLR Paper2541 Authors

17 Apr 2024 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: The extraordinary success of transformers as a sequence processing model is hindered by two things: the quadratic complexity of self-attention modules and the difficulty of transformer training. In this paper, we introduce two mechanisms aiming to alleviate these two mentioned problems: a novel neural-guided down-sampling for self-attention and a new attention non-linearity with linear-scaled and convex characteristics. Those two procedures not only speed up the self-attention computation but also greatly ease the pain of meticulous hyper-parameter tuning. Moreover, our relative positional encoding procedure applies to many types of data structures as well as special restraints, such as rotational invariance (i.e. for 3D point clouds). It is important to emphasize that our model is a foundation model that can work with multiple types of data structures including point clouds, graphs, and long-range sequences. As a foundation model, we achieved competitive results on many data structures against specialized ones in standard benchmarks, while being faster and more efficient in inference than other state-of-the-art baselines. We release our source code in supplementary materials.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Surbhi_Goel1
Submission Number: 2541
Loading