Student Lead Author Indication: No
Keywords: conversational AI, large language models, natural conversation, conversation analysis
TL;DR: It demonstrates a method for finding good conversation data.
Abstract: In principle, large language models could talk more like humans naturally do if they are trained on data containing the interaction patterns of human conversation. However, one challenge to training a conversation model is that natural conversation data are relatively difficult to find. In this paper we demonstrate a method for annotating documents at scale with a 0-5 conversation score. We use a large language model to score a sample of documents for how conversational they are. Using the annotated samples, we trained Snowflake-arctic-embed with a classification head that outputs a single regression score from 0 to 5 for conversation rating. When converted to a binary classifier using a score threshold of 4, the model achieved a precision of 94%. Our conversation score approach offers significant implications for data preparation in generative AI, particularly enhancing data annotation, filtering, and quality control.
Submission Number: 4
Loading