Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?

Subhajit Maity; Killian Hitsman; Xin Li; Aritra Dutta

Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?

Subhajit Maity, Killian Hitsman, Xin Li, Aritra Dutta

04 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-Head Self-Attention, Vision Transformers, Kolmogorov-Arnold Network, Kolmogorov-Arnold Transformers, Kolmogorov-Arnold Attention

Abstract: Kolmogorov-Arnold networks (KANs) are a remarkable innovation that consists of learnable activation functions, with the potential to capture more complex relationships from data. Presently, KANs are deployed by replacing multilayer perceptrons (MLPs) in deep networks, including advanced architectures such as vision Transformers (ViTs). Given the success of replacing MLP with KAN, this work asks whether KAN could learn token interactions. In this paper, we design the first learnable attention called **K**olmogorov-**Ar**nold **At**tention (KArAt) for ViTs that can operate on any basis, ranging from Fourier, Wavelets, Splines, to Rational Functions. However, learnable activations in the attention cause a memory explosion. To remedy this, we propose a modular version of KArAt that uses a low-rank approximation. By adopting the Fourier basis into this, Fourier-KArAt and its variants, in some cases, outperform their traditional softmax counterparts, or show comparable performance on CIFAR-10, CIFAR-100, and ImageNet-1K datasets. We also deploy Fourier KArAt to ConViT and Swin-Transformer, and use it in detection and segmentation with ViT-Det. We dissect the performance of these architectures on the classification task by analyzing their loss landscapes, weight distributions, optimizer paths, attention visualizations, and transferability to other datasets, and contrast them with vanilla ViTs. KArAt's learnable activation yields a better attention score across all ViTs, indicating improved token-to-token interactions and contributing to enhanced inference. However, many factors, including the present computing interface, affect the relative performance of parameter- and memory-heavy KArAts. We note that the goal of this paper is not to produce efficient attention or challenge the traditional activations; by designing KArAt, we are the first to show that attention can be learned and encourage researchers to explore KArAt in conjunction with more advanced architectures that require a careful understanding of learnable activations.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 1818

Loading