Abstract: One well-studied solution to avoid the need for vast amount of human-labeled
data is to use self-supervised training objectives during pre-training, which enables
learning on completely unlabeled examples. Especially in the case of larger
models such as LLMs, these pre-training procedures have demonstrated benefits
[Devlin et al., 2018]. In this work we focus on training LLMs for producing
semantically expressive sentence embeddings for User-Generated Content (UGC)
in comment-style mediums. We provide a novel self-supervised training paradigm
that leverages the structure of comment data and also demonstrate the efficacy of
LLM generation for producing quality training data. Through empirical evaluation,
we show improvements against existing baselines methods on several downstream
tasks.
Loading