Abstract: Parameter estimation is central to scientific inference, yet standard data collection practices, such as random sampling, often yield inefficient or suboptimal results when data are noisy, imbalanced, or expensive to obtain. In such settings, not all samples equally contribute to inference, motivating the need for principled methods to identify and prioritize the most informative data. We propose a data valuation framework based on Fisher information that quantifies each sample's contribution to the precision of parameter estimates. Unlike prediction performance-driven active learning, our method explicitly targets the improvement of inference precision rather than predictive generalization. By incorporating an adjusted Fisher Information metric, the framework naturally accounts for measurement noise and heteroscedasticity, assigning higher value to samples that most effectively reduce estimator variance. We provide theoretical guarantees for both linear and logistic regression, demonstrating faster convergence than CoreSet and BAIT approaches, with gains that scale logarithmically with the unlabeled pool size. Extensions to multivariate and non-Gaussian settings further show that parameter-focused data valuation offers a principled, efficient strategy for subset selection -- prioritizing the most informative observations under realistic, high-noise scientific conditions.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Tom_Rainforth1
Submission Number: 6475
Loading