Scene-adaptive Knowledge Distillation for Sequential Recommendation via Differentiable Architecture Search

Lei Chen; Fajie Yuan; Jiaxi Yang; Chengming Li; Min Yang

Scene-adaptive Knowledge Distillation for Sequential Recommendation via Differentiable Architecture Search

Lei Chen, Fajie Yuan, Jiaxi Yang, Chengming Li, Min Yang

Published: 28 Oct 2023, Last Modified: 23 Nov 2023WANT@NeurIPS 2023 PosterEveryoneRevisionsBibTeX

Keywords: Sequential recommendation, Knowledge distillation, Neural architecture search, Earth mover's distance

TL;DR: We propose AdaRec, a knowledge distillation (KD) framework which compresses knowledge of a teacher model into a lightweight student model adaptively according to its recommendation scene by using differentiable neural architecture search (NAS).

Abstract: Sequential recommender systems (SRS) have become a research hotspot due to their power in modeling user dynamic interests and sequential behavioral patterns. To maximize model expressive ability, a default choice is to apply a larger and deeper network architecture, which, however, often brings high network latency when generating online recommendations. Naturally, we argue that compressing the heavy recommendation models into middle- or light-weight neural networks that reduce inference latency while maintaining recommendation performance is of great importance for practical production systems. To realize such a goal, we propose AdaRec, a knowledge distillation (KD) framework which compresses knowledge of a teacher model into a student model adaptively according to its recommendation scene by using differentiable neural architecture search (NAS). Specifically, we introduce a target-oriented knowledge distillation loss to guide the network structure search process for finding the student network architecture, and a cost-sensitive loss as constraints for model size, which achieves a superior trade-off between recommendation effectiveness and efficiency. In addition, we leverage earth mover's distance (EMD) to realize many-to-many layer mapping during knowledge distillation, which enables each intermediate student layer to learn from other intermediate teacher layers adaptively. Extensive experiments on three real-world recommendation datasets demonstrate that our model achieves significantly better accuracy with notable inference speedup compared to strong counterparts, while discovering diverse architectures for sequential recommendation models under different recommendation scenes.

Submission Number: 3

Loading