Abstract: We propose TAROT, a targeted data selection framework grounded in Optimal Transport theory. Previous targeted data selection methods primarily rely on influence-based greedy heuristics to enhance domain-specific performance. While effective on limited, unimodal data (i.e., data following a single pattern), these methods struggle as target data complexity increases. Specifically, in multimodal distributions, such heuristics fail to account for multiple inherent patterns, leading to suboptimal data selection. This work identifies two primary limitations: (i) the disproportionate impact of dominant feature components in high-dimensional influence estimation, and (ii) the restrictive linear additive assumptions in greedy selection strategies. To address these challenges, TAROT incorporates whitened feature distance to mitigate dominant feature bias, offering a more reliable measure of data influence. Building on this, TAROT leverages whitened feature distance to quantify and minimize the optimal transport distance between selected data and target domains. Notably, this minimization also facilitates the estimation of optimal selection ratios. We evaluate TAROT across multiple tasks, including semantic segmentation, motion prediction, and instruction tuning. Results consistently show that TAROT outperforms state-of-the-art methods, demonstrating its versatility across various deep learning tasks. Code is available at: https://github.com/vita-epfl/TAROT.
Lay Summary: In today’s world, machine learning models are trained on enormous amounts of data. But not all data is equally useful—some examples are much more important for helping a model learn a specific task well. Our work introduces a new method called TAROT that helps choose the right data to train models more efficiently and effectively.
Imagine trying to teach someone to drive in a new city. Instead of showing them every possible street from every city, it’s smarter to show them the most relevant examples that reflect the roads and conditions in that particular city. TAROT does exactly this for machine learning: it selects the most relevant examples from a large pool to match a specific target task or domain.
What makes TAROT unique is that it uses a mathematical tool called “optimal transport” to better understand which data examples are most aligned with the target task. This approach avoids common pitfalls of previous methods that often missed important patterns, especially when the target data is complex.
We tested TAROT on a variety of problems—from understanding images and predicting vehicle movements to fine-tuning large language models—and found that it not only improved performance but also reduced the amount of data needed. This could make machine learning more efficient, faster, and accessible to those with limited resources.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/vita-epfl/TAROT
Primary Area: Applications
Keywords: Data Attribution, Data Selection, Optimal Transport
Submission Number: 12283
Loading