Keywords: Cortical processing, recurrent transformers, vision transformers, biological inductive bias, self-attention, neuromimetic vision, non-adversarial robustness, sparse models, model compression
TL;DR: We relate details of cortical connectivity at multiple scales to transformer processing, and in a vision task find that learnable attention parameters can be reduced by over 10x while being more robust against the most challenging examples.
Abstract: Transformer architectures in deep learning are increasingly relied on across domains with impressive results, but the observed growth of model parameters may be unsustainable and failures in robustness limit application. Tasks that are targeted across domains by transformers are enabled in biology by the mammalian neocortex, yet there is no clear understanding of the relationship between processing in the neocortex and the transformer architecture. While the relationship between convolutional neural networks (CNNs) and the cortex has been studied, transformers have more complex computations and multi-scale organization, offering a richer foundation for analysis and co-inspiration. We introduce a framework for enabling details of cortical connectivity at multiple organizational scales (micro-, meso-, and macro-) to be related to transformer processing, and investigate how cortical connectivity principles affect performance, using the CIFAR-10-C computer vision robustness benchmark task. Overall, we demonstrate the efficacy of our framework and find that incorporating components of cortical connectivity at multiple scales can reduce learnable attention parameters by over an order of magnitude, while being more robust against the most challenging examples in computer vision tasks. The cortical transformer framework and design changes we investigate are generalizable across domains, may inform the development of more efficient/robust attention-based systems, and further an understanding of the relationship between cortical and transformer processing.