Abstract: In this empirical article, we introduce INNAprop, an optimization algorithm that combines the INNA method with the RMSprop adaptive gradient scaling. It leverages second order information and rescaling while keeping the memory and compute requirements of standard DL methods as AdamW or SGD. INNAprop is evaluated on CIFAR-10, Food101, and ImageNet with ResNets, VGG, DenseNet, and ViT. We also train GPT-2 (OpenWebText) from scratch and with LoRA fine-tuning (E2E). INNAprop consistently offers close performance to AdamW, while performing significantly better in our LLM training experiments, achieving faster convergence and higher accuracy with minimal hyperparameter tuning, even at large scale. Our code is public.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: - Simplified the deviation of the algorithm.
- Tempered some claims about the performance of INNAprop.
- Added new experiments in the manuscript and in the appendix.
- Added authors affiliation and acknowledgements section.
Code: https://github.com/innaprop/innaprop
Supplementary Material: pdf
Assigned Action Editor: ~Mathurin_Massias1
Submission Number: 5110
Loading