Designing Injective and Low-Entropic Transformer for Short-Long Range Encoding

TMLR Paper939 Authors

11 Mar 2023 (modified: 09 May 2023)Rejected by TMLREveryoneRevisionsBibTeX
Abstract: Multi-headed self-attention-based Transformers have shown promise in different learning tasks. Albeit these models exhibit significant improvement in understanding short-term and long-term contexts from sequences, encoders of Transformers and their variants fail to preserve layer-wise contextual information. Transformers usually project tokens onto a sparse manifold and fail to preserve injectivity among the token representations. In this work, we propose TransJect, an encoder model that guarantees a theoretical bound for layer-wise distance preservation between a pair of tokens. We propose a simple alternative to dot product attention to ensure Lipschitz continuity. This allows TransJect to learn injective mappings to transform token representations to different manifolds with similar topology and preserve Euclidean distance between every pair of tokens in subsequent layers. Evaluations across multiple benchmark short- and long-sequence classification tasks show maximum improvements of $6.8\%$ and $5.9\%$, respectively, over the variants of Transformers. TransJect achieves the best average accuracy on the long-range arena benchmark, showcasing its superiority in capturing temporal and spatial hierarchical relationships from long sequences. We further highlight the shortcomings of multi-headed self-attention from the statistical physics viewpoint. Although multi-headed self-attention was incepted to learn different abstraction levels within the networks, our empirical analyses suggest that different attention heads learn randomly and unorderly. On the contrary, TransJect adapts a mixture of experts for regularization; these experts are found to be more orderly and balanced and learn different sparse representations from the input sequences. TransJect exhibits very low entropy, and therefore, can be efficiently scaled to larger depths.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: All text changes are highlighted in blue color. To clarify the doubts raised by reviewer D7jj, we added "non-normalized linear" in Theorem 1 and added a clarification to the proof of the theorem. To address the motivation of the proposed mixture of experts (MOE), we added corollary 2 in the paper and added the proof of the corollary in Appendix A.5. To address the doubts raised by reviewer 3bzP, we added the derivation of the Lipschitz bound of $F$ in the proof of lemma 3 in Appendix A.4. We added the implementation details in Appendix B.2.
Assigned Action Editor: ~Vincent_Dumoulin1
Submission Number: 939
Loading