Keywords: Automatic Feature Transformation, Tabular Data, Federated Learning
Abstract: Tabular data plays a crucial role in numerous real-world decision-making applica-
tions, but extracting valuable insights often requires sophisticated feature transfor-
mations. These transformations mathematically transform raw data, significantly
improving predictive performance. In practice, tabular datasets are frequently
fragmented across multiple clients due to widespread data distribution, privacy
constraints, and data silos, making it challenging to derive unified and generalized
insights. To address these issues, we propose a novel Federated Feature Transfor-
mation (FEDFT) framework that enables collaborative learning while preserving
data privacy. In this framework, each local client independently computes feature
transformation sequences and evaluates the corresponding model performances.
Instead of exchanging sensitive original data, clients transmit these transforma-
tion sequences and performance metrics to a central global server. The server
then compresses and encodes the aggregated knowledge into a unified embedding
space, facilitating the identification of optimal feature transformation sequences.
To ensure accurate and unbiased aggregation, we employ a sample-aware weight-
ing strategy, assigning higher weights to clients with larger, more diverse, and
numerically stable datasets, as their performance metrics are statistically reliable
and representative. We also incorporate a server-side calibration mechanism to
adaptively refine the unified embedding space, mitigating bias from outlier data
distributions. Furthermore, to ensure optimal transformation sequences at both
global and local scales, the globally optimal sequences are disseminated back to
local clients. We subsequently develop a sequence fusion strategy that blends
these globally optimal features with essential non-overlapping local transforma-
tions critical for local predictions. Extensive experiments are conducted to demon-
strate the efficiency, effectiveness, and robustness of our framework. Code and
data are publicly available
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 14959
Loading