FineWeb-Conv: A Method for Finding Good Conversation Data

Published: 13 Dec 2024, Last Modified: 19 Feb 2025Good-DataEveryoneRevisionsBibTeXCC BY 4.0
Student Lead Author Indication: No
Keywords: conversational AI, large language models, natural conversation, conversation analysis
TL;DR: It demonstrates a method for finding good conversation data.
Abstract: In principle, large language models could talk more like humans naturally do if they are trained on data containing the interaction patterns of human conversation. However, one challenge to training a conversation model is that natural conversation data are relatively difficult to find. In this paper we demonstrate a method for annotating documents at scale with a 0-5 conversation score. We use a large language model to score a sample of documents for how conversational they are. Using the annotated samples, we trained Snowflake-arctic-embed with a classification head that outputs a single regression score from 0 to 5 for conversation rating. When converted to a binary classifier using a score threshold of 4, the model achieved a precision of 94%. Our conversation score approach offers significant implications for data preparation in generative AI, particularly enhancing data annotation, filtering, and quality control.
Submission Number: 4
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview