Feedback-Guided Dataset Shaping for Automated Downstream Task Optimization

Lukas Nolte, Marten J. Finck, Sören Pirk, Sven Tomforde

Published: 2025, Last Modified: 29 Nov 2025ARCS 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Self-learning agents in open-world environments often face data scarcity due to the diversity and unpredictability of real-world conditions. Traditional datasets often over-represent common cases while under-representing rare but critical events, leading to biased learning and poor generalization. To address this, we propose dataset shaping. Dataset shaping aims to use generative models, such as multi-task diffusion models (MTDMs), to generate and refine synthetic data-label pairs through a feedback loop. By dynamically adjusting the data composition according to the performance of a self-learning agent-based downstream task model (DSTM), generative models expose agents to diverse and challenging scenarios, which leads to increased robustness and adaptability. Consequently, dataset shaping enhances generalization, particularly in applications, such as autonomous driving and robotics, where reliable performance in novel conditions is essential.

External IDs:dblp:conf/arcs/NolteFPT25