Learning the Transformer Kernel
Abstract: In this work we introduce KL-TRANSFORMER, a generic, scalable, data driven framework for learning the kernel function in Transformers. Our framework approximates the Transformer kernel as a dot product between spectral feature maps and learns the kernel by learning the spectral distribution. This not only helps in learning a generic kernel end-to-end, but also reduces the time and space complexity of Transformers from quadratic to linear. We show that KL-TRANSFORMERs achieve performance comparable to existing efficient Transformer architectures, both in terms of accuracy and computational efficiency. Our study also demonstrates that the choice of the kernel has a substantial impact on performance, and kernel learning variants are competitive alternatives to fixed kernel Transformers, both in long as well as short sequence tasks.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Paper edited in accordance with suggestions given by the action editor. The last sentence of section 4.1 has been removed and any causal-sounding claims have been edited to clarify that they are correlations only. Paper has been deanonymized and a link to GitHub has been added for reproduction purposes
Assigned Action Editor: ~Matthew_Blaschko1
Submission Number: 63