Self-Imputation and Cross-Variable Learning Improve Water Quality Prediction with Sparse Data

Published: 09 Jun 2025, Last Modified: 09 Jun 2025FMSD @ ICML 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: tabular foundation model, imputation, water quality, missing data, cross-variable, time series
Abstract: Accurate water quality prediction is essential for effective environmental management, yet infrequent sampling results in severe data sparsity, posing significant challenges for training traditional deep learning models. To address this, we propose a novel two-stage framework that leverages a tabular foundation model for multivariate time series prediction under sparse data conditions. In the first stage, the model self-imputes missing water quality values using hydroclimatic and calendar-based features; in the second stage, the imputed time series of all other water quality variables serve as augmented inputs to further improve prediction for each target variable. Evaluated on a continental-scale dataset, our proposed solution significantly outperforms both direct foundation models and traditional deep learning model baselines. We also demonstrate that explicit self-imputation for missing data yields more accurate predictions than relying on the model's internal mechanisms. To the best of our knowledge, this is the first study to demonstrate the effectiveness of tabular foundation models for sparse environmental time series prediction, providing a reliable and data-efficient alternative to traditional deep sequence models.
Submission Number: 79
Loading