Camel: Managing Data for Efficient Stream Learning

Yiming Li, Yanyan Shen, Lei Chen

Published: 2022, Last Modified: 12 May 2023SIGMOD Conference 2022Readers: Everyone

Abstract: Many real-world applications rely on predictive models that are incrementally learned online. Specifically, models are updated with a single pass over continuously arriving data batches in a typical stream learning framework. However, this framework has three shortcomings: high training cost, low data effectiveness, and catastrophic forgetting. We describe Camel, a system that addresses the above issues. Camel includes two independent data management components: coreset selection and buffer update. To accelerate model training, Camel selects a coreset from each streaming data batch for model update. Selecting a coreset with worst-case guarantees is NP-hard. To solve this problem, we reformulate coreset selection as a submodular maximization problem by deriving an upper bound on the objective function. To mitigate catastrophic forgetting, Camel maintains a buffer of past representative samples as new data arrive. Moreover, Camel quantizes numerical data in buffer via a quantile sketch to reduce the memory footprint. Finally, extensive experiments validate the effectiveness and efficiency of Camel. In particular, our coreset selection algorithm can achieve a linear speedup with a marginal accuracy loss on redundant datasets. Furthermore, our buffer update algorithms can outperform the state-of-the-art methods for anti-forgetting on various data distributions.

0 Replies