Learning Large-scale Universal User Representation with Sparse Mixture of ExpertsDownload PDF

26 May 2022, 20:09 (modified: 23 Jul 2022, 02:24)ICML 2022 Pre-training WorkshopReaders: Everyone
Keywords: user representation, mixture of expert, transformer
Abstract: Learning user sequence behaviour embedding is very sophisticated and challenging due to the complicate feature interaction over time and high dimension of user features. Recent emerging foundation models \textit{e}.\textit{g}. BERT and its variants, encourage a large body of researchers to investigate in this field. However, unlike natural language processing(NLP) tasks, the parameters of user behaviour model comes mostly from user embedding layer which makes most existing works fail to train an universal user embedding at large scale. Furthermore, user representations are learned from multiple downstream tasks, the past research did not address the seesaw phenomenon.In this paper, we propose SUPERMOE, a generic framework for obtain high quality user representation from multiple tasks. Specifically, the user behaviour sequences are encoded by MoE transformer, thus we can improve the model capacity to billions of parameters even trillions. In order to deal with seesaw phenomenon when learning across multiple tasks, we design a new loss function with task indicators. We perform extensive offline experiments on public datasets and online experiments on private real world business scenarios. Our approach achieves best performance over state-of-art models, the results demonstrate the effectiveness of our user behaviour representation framework using MOE transformer.
0 Replies