TL;DR: We propose a new, principled yet practical continual learning method that combines the complementary benefits of function-regularization, weight-regularization and experience replay.
Abstract: Regularization and experience replay are two popular continual-learning strategies with complementary strengths: while regularization requires less memory, replay can more accurately mimic batch training. But can we combine them to get provably better methods? Despite the simplicity of the question, little is known or done to find optimal combination methods that give provable improvements. In this paper, we present such a method by using a recently proposed principle of adaptation that relies on a faithful reconstruction of the gradients of the past data. Using this principle, we design a prior which combines two types of replay methods with a quadratic Bayesian weight-regularizer and achieves provably better gradient reconstructions. The combination improves performance on standard benchmarks such as Split CIFAR, Split TinyImageNet, and ImageNet-1000, often achieving >80% of the batch performance by simply utilizing a memory of <10% of the past data. Our work shows that a good combination of replay and regularizer-based methods can be very effective in reducing forgetting, and can sometimes even completely eliminate it.