Keywords: large language model, continual learning, benchmark
Abstract: Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. However, the continual learning aspect of these aligned LLMs has been largely overlooked. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs due to their homogeneous task types and low task complexity. To bridge this gap, we introduce TRACE, a benchmark designed to rigorously assess continual learning capabilities in LLMs. TRACE comprises eight challenging tasks from the scope of domain-specific tasks, multilingual capabilities, code generation, and mathematical reasoning. Through systematic experiments on TRACE with six different aligned models ranging from 7B to 70B, we discovered significant declines in both general performance and instruction-following abilities. For example, the accuracy of llama2-chat 13B on the gsm8k dataset declined precipitously from 43.14\% to 2.12\% after training on our datasets. This highlights the challenge of finding a suitable tradeoff between achieving performance on specific tasks while preserving the original prowess of LLMs. Our results demonstrate that integrating task-specific cues with meta-rationales significantly reduces catastrophic forgetting and improves task convergence, offering a viable strategy to enhance the adaptability of LLMs in dynamic environments.
Supplementary Material: pdf
Submission Number: 2421
Loading