Adaptive adam-based optimizers using second-order weight decoupling and gradient-aware weight decay for vision transformer

Boyapati Hemanth Sai, Snehasis Mukherjee, Shiv Ram Dubey

Published: 2025, Last Modified: 27 Feb 2026Mach. Vis. Appl. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Optimizers play important roles in enhancing the performance of a deep network. A study on different optimizers is necessary to understand the effect of optimizers on the performance of the deep network for a given target task, such as image classification. Several attempts were made to investigate the effect of optimizers on the performance of CNNs. However, such experiments have not been carried out on vision transformers (ViT), despite the recent success of ViT in various image processing tasks. In this paper, we conduct exhaustive experiments with ViT using different optimizers. In our experiments, we found that weight decoupling and weight decay in optimizers play important roles in training ViT. We focused on the concept of weight decoupling and tried different variations of it to investigate to what extent weight decoupling is beneficial for a ViT. We propose two techniques that provide better results than weight-decoupled optimizers: (i) The weight decoupling step in optimizers involves a linear update of the parameter with weight decay as the scaling factor. We propose a quadratic update of the parameter which involves using a linear as well as squared parameter update using the weight decay as the scaling factor. (ii) We propose using different weight decay values for different parameters depending on the gradient value of the loss function with respect to that parameter. A smaller weight decay is used for parameters with a higher gradient value and vice versa. Image classification experiments are conducted over CIFAR-100 and TinyImageNet datasets to observe the performance of these proposed methods with respect to state-of-the-art optimizers such as Adam, RAdam, and AdaBelief. The code is available at https://github.com/Hemanth-Boyapati/Adaptive-weight-decay-optimizers.
Loading