TL;DR: We leverage policy gradients to directly optimize non-differentiable evaluation metrics for LLM instruction tuning data selection.
Abstract: Curating data for instruction tuning is crucial for enhancing the performance of large language models (LLMs). This work aims to select training data for instruction tuning to improve the LLM performance on specific tasks. Existing methods often rely on next-token prediction (NTP) loss as a proxy for target task performance due to the non-differentiable nature of performance evaluation metrics. They select training data points that are most helpful in reducing validation loss. However, there is a discrepancy between minimizing NTP loss and maximizing performance (e.g., code pass rate in code generation). To remedy this, we introduce a novel Non-differentiable evaluation metric-based InfluenCe Estimation (NICE), which leverages the policy gradient to select the training data that improves the performance. Moreover, NICE can perform data selection in the absence of labels (ground-truth responses) when the evaluation metrics do not require labels (e.g., a reward model can output reward scores without supervision from labels). Experimental results show that our approach outperforms existing data selection baselines that use NTP loss in diverse and realistic scenarios. Notably, subsets selected by NICE often produce models that outperform those trained on the full dataset. Our code is available at https://github.com/JTWang2000/NICE.
Lay Summary: Training large language models (LLMs) to follow instructions well depends on selecting the right training data. Existing methods often rely on loss functions like next-token prediction loss (how well a model predicts next tokens). However, loss is not necessarily strongly correlated with performance on real-world tasks, like writing working code.
We introduce a new method called NICE that selects training data based on how much it improves actual task performance, measured by the evaluation metrics. It uses policy gradients (from reinforcement learning) to estimate which examples are most helpful for improving the performance instead of the loss. Unlike prior methods, NICE can even work when labeled data isn't available, as long as the evaluation metric doesn’t need ground-truth labels.
Experiments show NICE consistently outperforms existing approaches, often producing better models with less data, showing that smart data selection matters.
Primary Area: Deep Learning->Large Language Models
Keywords: data selection, data curation, instruction tuning, large language models
Submission Number: 11197
Loading