TiC-LM: A Multi-Year Benchmark for Continual Pretraining of Language Models

Jeffrey Li; Mohammadreza Armandpour; Seyed Iman Mirzadeh; Sachin Mehta; Vaishaal Shankar; Raviteja Vemulapalli; Samy Bengio; Oncel Tuzel; Mehrdad Farajtabar; Hadi Pouransari; Fartash Faghri

TiC-LM: A Multi-Year Benchmark for Continual Pretraining of Language Models

Jeffrey Li, Mohammadreza Armandpour, Seyed Iman Mirzadeh, Sachin Mehta, Vaishaal Shankar, Raviteja Vemulapalli, Samy Bengio, Oncel Tuzel, Mehrdad Farajtabar, Hadi Pouransari, Fartash Faghri

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: language models, continual learning, benchmark, temporal adaptation

TL;DR: We introduce a benchmark for learning language models continuously over months and years.

Abstract: Large language models (LLMs) are trained on data crawled over many years from the web. We investigate how quickly LLMs become outdated as the world evolves with time and how to best update them with newer data. Specifically, we simulate a world where the latest dump of Common Crawl (CC), the most prominent public source of pre-training data, is used every month to *continually* train an LLM. We design various dynamic evaluations from the CC data, Wikipedia, StackExchange, and code documentations to measure continual learning metrics such as forgetting and forward transfer. Notably, our TiC-CC training data is more than 100 times larger compared with prior continual learning benchmarks for language modeling. We discover that recent DataComp-LM models trained on data before 2023 have already become outdated, incurring up to 45\% larger noun-perplexity on 2024 Wikipedia articles compared to pre-2023 articles. Further, we use our setup to evaluate the effectiveness of several large-scale continual learning methods and find that replaying older data is most effective for combating forgetting: for previously seen CC dumps, it can reduce the regret on held-out loss by 60\% compared to other optimizer and loss-based interventions. However, some domains evolve more quickly than others, favoring different trade-offs between mixing old and new data.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12130

Loading