Learning the Transformer Kernel

Sankalan Pal Chowdhury; Adamos Solomou; Kumar Avinava Dubey; Mrinmaya Sachan

Learning the Transformer Kernel

Sankalan Pal Chowdhury, Adamos Solomou, Kumar Avinava Dubey, Mrinmaya Sachan

Published: 21 Jul 2022, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: In this work we introduce KL-TRANSFORMER, a generic, scalable, data driven framework for learning the kernel function in Transformers. Our framework approximates the Transformer kernel as a dot product between spectral feature maps and learns the kernel by learning the spectral distribution. This not only helps in learning a generic kernel end-to-end, but also reduces the time and space complexity of Transformers from quadratic to linear. We show that KL-TRANSFORMERs achieve performance comparable to existing efficient Transformer architectures, both in terms of accuracy and computational efficiency. Our study also demonstrates that the choice of the kernel has a substantial impact on performance, and kernel learning variants are competitive alternatives to fixed kernel Transformers, both in long as well as short sequence tasks.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: Paper edited in accordance with suggestions given by the action editor. The last sentence of section 4.1 has been removed and any causal-sounding claims have been edited to clarify that they are correlations only. Paper has been deanonymized and a link to GitHub has been added for reproduction purposes

Code: https://github.com/cs1160701/OnLearningTheKernel

Assigned Action Editor: ~Matthew_Blaschko1

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Submission Number: 63

Loading