Input normalized stochastic gradient descent for language tasks

Esra Türeyen, Salih Furkan Atici, Ahmet Enis Çetin, Ömer Morgül

Published: 2025, Last Modified: 27 Feb 2026Signal Image Video Process. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper, we train various Natural Language Processing (NLP) tasks using the Input Normalized Stochastic Gradient Descent (INSGD) optimizer. We fine-tune the Bidirectional Encoder Representations from Transformer (BERT) model on the General Language Understanding Evaluation (GLUE) benchmark with INSGD optimizer. Adaptive Moment Estimation (Adam) and Stochastic Gradient Descent (SGD) optimizers are used as performance comparison baselines. INSGD optimizer leverages SGD optimizer with adaptive \(L_{1}\) and \(L_{2}\) based learning rate normalizations using layer inputs, drawing inspiration from Normalized Least Mean Square (NLMS) algorithm for improved weight updates. We assess the performance of the experiments using the GLUE score on the validation data. Our experiments demonstrate that INSGD achieves higher GLUE scores compared to SGD and Adam across multiple datasets in GLUE benchmark. INSGD surpasses SGD in RTE, MRPC and CoLA datasets, Adam in MNLI-mismatched dataset, and both SGD and Adam in MNLI-matched, QQP, QNLI and SST-2 datasets.

External IDs:dblp:journals/sivp/TureyenACM25