Rule-Based Rating and Selection of LLM Training Data

Xiaomin Li; Mingye Gao; Zhiwei Zhang; Chang Yue; Hong Hu

Rule-Based Rating and Selection of LLM Training Data

Xiaomin Li, Mingye Gao, Zhiwei Zhang, Chang Yue, Hong Hu

25 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models, data selection, data rating

Abstract: The quality of training data is crucial for the performance of large language models (LLMs). There are recent studies utilizing LLMs to rate and select data based on scores from a small set of human-designed metrics (rules). However, existing rule-based methods often overly rely on human heuristics, lack robust metrics for rule evaluation, and exhibit limited adaptability to new tasks. In our work, we propose a novel rule-based framework that leverages the orthogonality of score vectors corresponding to rules as a unique metric for rule evaluation. Our method employs an automated pipeline that first uses LLMs to generate a diverse set of rules, covering a wide range of rating aspects. It then rates a batch of data according to these rules and applies the determinantal point process (DPP) from random matrix theory to select the most orthogonal score vectors, effectively isolating a subset of independent rules. Then these rules are applied to rate all data and samples with the highest average scores are selected for further downstream tasks such as LLM training. We validate our method through two experimental setups: 1) comparison against ground truth ratings and 2) benchmarking LLMs trained with the selected data. Our extensive experiments span various settings, including general pre-training and domain-specific fine-tuning in fields such as IMDB, Medical, Math, and Code. The results show that our DPP rule-based rating method consistently outperforms other methods, such as rating without rules, uniform sampling, importance resampling, and QuRating, in terms of both rating accuracy and model performance.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4110

Loading