Embedding User-Generated Content using Structural Supervision and Generative Models

Published: 13 Dec 2023, Last Modified: 09 Aug 2024Efficient Natural Language and Speech Processing (ENLSP-III) workshop in the 37th Conference on Neural Information Processing Systems (NeurIPS 2023)EveryoneCC BY 4.0
Abstract: One well-studied solution to avoid the need for vast amount of human-labeled data is to use self-supervised training objectives during pre-training, which enables learning on completely unlabeled examples. Especially in the case of larger models such as LLMs, these pre-training procedures have demonstrated benefits [Devlin et al., 2018]. In this work we focus on training LLMs for producing semantically expressive sentence embeddings for User-Generated Content (UGC) in comment-style mediums. We provide a novel self-supervised training paradigm that leverages the structure of comment data and also demonstrate the efficacy of LLM generation for producing quality training data. Through empirical evaluation, we show improvements against existing baselines methods on several downstream tasks.
Loading