Abstract: While norm-based and leverage-score-based methods have been extensively studied for identifying "important" data points in linear models, analogous tools for nonlinear models remain significantly underdeveloped. By introducing the concept of the adjoint operator of a nonlinear map, we address this gap and generalize norm-based and leverage-score-based importance sampling to nonlinear settings. We demonstrate that sampling based on these generalized notions of norm and leverage scores provides approximation guarantees for the underlying nonlinear mapping, similar to linear subspace embeddings. As direct applications, these nonlinear scores not only reduce the computational complexity of training nonlinear models by enabling efficient sampling over large datasets but also offer a novel mechanism for model explainability and outlier detection. Our contributions are supported by both theoretical analyses and experimental results across a variety of supervised learning scenarios.
Lay Summary: Modern AI systems often rely on enormous datasets, which can dramatically increase computational demands—leading to higher costs, more energy use, and technical challenges. One way to tackle this is by shrinking these datasets: keeping only the most "important" data points to reduce cost and speed up training. The problem? Existing methods for finding these key samples only work well for simple models—not for the complex, nonlinear models like neural networks that power today’s AI in areas like image recognition, language processing, and healthcare.
We’ve developed a new mathematical framework that brings these sampling ideas to nonlinear models, and backs them up with strong theoretical guarantees. Think of it like summarizing a long book by keeping only the chapters that carry the core message—our method does this with data, even for models that learn complicated patterns. By examining how each data point influences the model’s learning, we can identify the ones that matter most, even in challenging scenarios like rare disease detection or high-resolution image analysis.
This advancement means AI systems can train faster and more cheaply while still performing at top levels. It can help cut the cost of data labeling, lower energy usage, and even improve how we understand model decisions—making AI systems more transparent and trustworthy. By focusing on data quality over quantity, this work takes a major step toward making AI more accessible, efficient, and sustainable.
Link To Code: https://github.com/prprakash02/Importance-Sampling-for-Nonlinear-Models
Primary Area: General Machine Learning
Keywords: Importance Sampling, Nonlinear Adjoint, Leverage Scores, Active Learning
Submission Number: 5630
Loading