Rule-Based Rating and Selection of LLM Training Data

Published: 06 Mar 2025, Last Modified: 07 Mar 2025ICLR 2025 Workshop Data Problems PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Data selection, data quality, LLM-as-a-judge, determinantal point processing
TL;DR: Using Determinantal Point Processing to select orthogonal rules rule rule-based data selection
Abstract: The quality of training data is crucial for the performance of large language models (LLMs). There are recent studies utilizing LLMs to rate and select data based on scores from a small set of human-designed metrics (rules). However, existing rule-based methods often overly rely on human heuristics, lack robust metrics for rule evaluation, and exhibit limited adaptability to new tasks. In this paper, we propose a novel rule-based framework that leverages the orthogonality of score vectors corresponding to rules as a unique metric for rule evaluation. Our method employs an automated pipeline that first uses LLMs to generate a rule set that covers a wide range of data quality aspects. It then rates a batch of data according to these rules and applies the determinantal point process (DPP) from random matrix theory to select the most independent (orthogonal) rules. Then these rules are applied to rate all data and samples with the highest average scores are selected for further downstream tasks such as LLM fine-tuning. We validate our method through two experimental setups: 1) comparing against ground truth ratings and 2) benchmarking LLMs trained with the selected data. Our extensive experiments span various settings, including fine-tuning LLMs across the IMDB, Medical, Math, and Code domains. The results show that our DPP rule-based rating method consistently outperforms various other baselines, in terms of both rating accuracy and benchmark performance.
Submission Number: 6
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview