QuRating: Selecting High-Quality Data for Training Language Models

ICLR 2024 Workshop DMLR Submission66 Authors

Published: 04 Mar 2024, Last Modified: 02 May 2024DMLR @ ICLR 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: language models, data selection
TL;DR: We select LM training data based on qualitative criteria of text.
Abstract: Selecting high-quality pre-training data is important for creating capable language models, but existing methods rely on simple heuristics. We introduce QuRating, a method for selecting pre-training data that captures the abstract qualities of texts which humans intuitively perceive. In this paper, we investigate four qualities—*writing style*, *required expertise*, *facts & trivia*, and *educational value*. We employ LLMs to discern these qualities and obtain more reliable judgments by prompting for pairwise comparisons between texts. We train a QuRater model to learn scalar ratings from pairwise judgments, and use it to annotate a 260B training corpus with fine-grained quality ratings. In our experiments, we sample 30B tokens according to the different quality ratings and train 1.3B-parameter language models on the selected data. We find that it is important to balance quality and diversity when selecting data, and with appropriate sampling, our models achieve lower perplexity and stronger in-context learning performance than baselines. Beyond data selection, we use quality ratings to construct curricula which improve performance without changing the training dataset. We feature extensive analysis of the characteristics and biases of the quality ratings. To encourage further research, we release our prompts, models, and annotated data (QuRatedPajama).
Primary Subject Area: Role of data in foundation models: pre-training, prompting, fine-tuning
Paper Type: Research paper: up to 8 pages
Participation Mode: In-person
Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Submission Number: 66
Loading