On the Ricci Curvature of Attention Maps and Transformers Training and Robustness

Amirhossein Farzam; Oded Schlesinger; Joshua M. Susskind; Juan Matias Di Martino; Guillermo Sapiro

On the Ricci Curvature of Attention Maps and Transformers Training and Robustness

Amirhossein Farzam, Oded Schlesinger, Joshua M. Susskind, Juan Matias Di Martino, Guillermo Sapiro

Published: 23 Oct 2024, Last Modified: 24 Feb 2025NeurReps 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Transformers, Attention, Geometry, Robustness

TL;DR: This study connects the Ricci curvature of attention maps, a measure of graph stability, to transformer training, performance, and robustness, and introduces a curvature-adjustment technique to control the behavior of transformers.

Abstract: Transformer models have revolutionized machine learning, yet the underpinnings behind their success are only beginning to be understood. In this work, we analyze transformers through the geometry of attention maps, treating them as weighted graphs and focusing on Ricci curvature, a metric linked to spectral properties and system robustness. We prove that lower Ricci curvature, indicating lower system robustness, leads to faster convergence of gradient descent during training. We also show that a higher frequency of positive curvature values enhances robustness, revealing a trade-off between performance and robustness. Building on this, we propose a regularization method to adjust the curvature distribution and provide experimental results supporting our theoretical predictions while offering insights into ways to improve transformer training and robustness. The geometric perspective provided in our paper offers a versatile framework for both understanding and improving the behavior of transformers.

Submission Number: 16

Loading