A Unified Framework for Generalization Error Analysis of Learning with Arbitrary Discrete Weak Features

Kosuke Sugiyama; Masato Uchida

A Unified Framework for Generalization Error Analysis of Learning with Arbitrary Discrete Weak Features

Kosuke Sugiyama, Masato Uchida

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: This paper presents a unified framework for learning with low-quality discrete features and analyze the theoretical interplay between feature estimation performance and downstream predictive performance.

Abstract: In many real-world applications, predictive tasks inevitably involve low-quality input features (Weak Features; WFs) which arise due to factors such as misobservations, missingness, or partial observations. While several methods have been proposed to estimate the true values of specific types of WFs and to solve a downstream task, a unified theoretical framework that comprehensively addresses these methods remains underdeveloped. In this paper, we propose a unified framework called Weak Features Learning (WFL), which accommodates arbitrary discrete WFs and a broad range of learning algorithms, and we demonstrate its validity. Furthermore, we introduce a class of algorithms that learn both the estimation model for WFs and the predictive model for a downstream task and perform a generalization error analysis under finite-sample conditions. Our results elucidate the interdependencies between the estimation errors of WFs and the prediction error of a downstream task, as well as the theoretical conditions necessary for the learning approach to achieve consistency. This work establishes a unified theoretical foundation, providing generalization error analysis and performance guarantees, even in scenarios where WFs manifest in diverse forms.

Lay Summary: How the quality of input information affects the performance of predictive models trained using machine learning techniques? Specifically, we examine the relationship between the quality of input data and the resulting performance of predictive models trained on such improved inputs. We show that the quality of input data substantially influences the number of training samples required to achieve a target level of predictive accuracy. Our findings provide a foundation for theoretical investigations into the interplay between data quality and learning performance.

Link To Code: https://github.com/KOHsEMP/discrete_WFL

Primary Area: Theory->Learning Theory

Keywords: weak features learning, impute-then-regress, complementary features learning, missing value, weak supervised learning

Submission Number: 5278

Loading