Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training

Vaibhav Singh; Paul Janson; Paria Mehrbod; Adam Ibrahim; Irina Rish; Eugene Belilovsky; Benjamin Thérien

Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training

Vaibhav Singh, Paul Janson, Paria Mehrbod, Adam Ibrahim, Irina Rish, Eugene Belilovsky, Benjamin Thérien

Published: 11 Jun 2025, Last Modified: 10 Jul 2025ES-FoMo IIIEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Infinite Learning Rate schedule, Learning Rate Schedule, Continual pre-training, Self Supervised, Continual Learning, Forgetting

TL;DR: We establish the performance of infinite learning rate schedules in the context of self-supervised continual pre-training, showing that the, untill now, overlooked area of learning rate scheduling is important for continual learning.

Abstract: The growing availability of unlabeled data offers both opportunities and challenges for training AI systems. Self-supervised learning (SSL) has emerged as a powerful method for extracting representations from such data, but existing techniques struggle to adapt to non-stationary, non-IID real-world data without forgetting prior knowledge. While recent works use a cosine annealing schedule for continual pre-training, this approach causes forgetting during re-warming and hasn't been compared to other SSL methods. In this work, we compare the cosine schedule with the recently proposed infinite learning rate schedule and find the latter to be more effective. Our extensive evaluation across image and language datasets shows that the infinite learning rate schedule is a flexible and robust alternative, performing well without needing a fixed iteration budget. It demonstrates stable and effective performance in both small and large-scale pre-training setups, retaining knowledge and adapting across tasks.

Submission Number: 105

Loading