Data Synthesis with Influence Rewarded Models

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: data synthesis, reinforcement learning
TL;DR: We train models with GRPO using influence functions as rewards to improve the sample efficiency of data synthesis pipelines.
Abstract: Data quality is the new bottleneck in developing capable, competitive models. There are two issues. Firstly, generating good quality data has no clear guidelines -- is it a matter of format, topics covered, length, non-redundant samples? Previous works use rejection sampling to generate a large pool of samples and filter out bad quality samples. However, this is wasteful. Secondly, previous works rely on larger, close-source models to extract model weaknesses, necessary skills, or a curriculum off of which to base data generation. However, this is uninterpretable because there is no guidance as to why the larger model generated a particular sample. Influence functions provide good, model-centric signals on good quality data, which will have concrete impact on model learning. In order to improve the sample efficiency of generating high quality data, we propose to train models to generate high quality data with influence rewards during GRPO fine-tuning. This is still a work in progress, so we only provide preliminary results and analysis. So far, we see that models are able to generate samples that look good on the surface level, but have minor mistakes that need to be fixed.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 49
Loading