TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models

30 May 2024 (modified: 13 Nov 2024)Submitted to NeurIPS 2024 Track Datasets and BenchmarksEveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language model, continual learning, benchmark
Abstract: Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. However, the continual learning aspect of these aligned LLMs has been largely overlooked. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs due to their homogeneous task types and low task complexity. To bridge this gap, we introduce TRACE, a benchmark designed to rigorously assess continual learning capabilities in LLMs. TRACE comprises eight challenging tasks from the scope of domain-specific tasks, multilingual capabilities, code generation, and mathematical reasoning. Through systematic experiments on TRACE with six different aligned models ranging from 7B to 70B, we discovered significant declines in both general performance and instruction-following abilities. For example, the accuracy of llama2-chat 13B on the gsm8k dataset declined precipitously from 43.14\% to 2.12\% after training on our datasets. This highlights the challenge of finding a suitable tradeoff between achieving performance on specific tasks while preserving the original prowess of LLMs. Our results demonstrate that integrating task-specific cues with meta-rationales significantly reduces catastrophic forgetting and improves task convergence, offering a viable strategy to enhance the adaptability of LLMs in dynamic environments.
Supplementary Material: pdf
Submission Number: 2421
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview