Keywords: large language models, data selection, data rating
Abstract: The quality of training data is crucial for the performance of large language models (LLMs). There are recent studies utilizing LLMs to rate and select data based on scores from a small set of human-designed metrics (rules). However, existing rule-based methods often overly rely on human heuristics, lack robust metrics for rule evaluation, and exhibit limited adaptability to new tasks. In our work, we propose a novel rule-based framework that leverages the orthogonality of score vectors corresponding to rules as a unique metric for rule evaluation. Our method employs an automated pipeline that first uses LLMs to generate a diverse set of rules, covering a wide range of rating aspects. It then rates a batch of data according to these rules and applies the determinantal point process (DPP) from random matrix theory to select the most orthogonal score vectors, effectively isolating a subset of independent rules. Then these rules are applied to rate all data and samples with the highest average scores are selected for further downstream tasks such as LLM training. We validate our method through two experimental setups: 1) comparison against ground truth ratings and 2) benchmarking LLMs trained with the selected data. Our extensive experiments span various settings, including general pre-training and domain-specific fine-tuning in fields such as IMDB, Medical, Math, and Code. The results show that our DPP rule-based rating method consistently outperforms other methods, such as rating without rules, uniform sampling, importance resampling, and QuRating, in terms of both rating accuracy and model performance.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4110
Loading