TL;DR: A novel masked autoencoding imputation method that learns from real missingness patterns and feature context
Abstract: We present CACTI, a masked autoencoding approach for imputing tabular data that leverages the structure in missingness patterns and contextual information. Our approach employs a novel median truncated copy masking training strategy that encourages the model to learn from empirical patterns of missingness while incorporating semantic relationships between features — captured by column names and text descriptions — to better represent feature dependence. These dual sources of inductive bias enable CACTIto outperform state-of-the-art methods — an average $R^2$ gain of 7.8\% over the next best method (13.4%, 6.1%, and 5.3% under missing not at random, at random and completely at random, respectively) — across a diverse range of datasets and missingness conditions. Our results highlight the value of leveraging dataset-specific contextual information and missingness patterns to enhance imputation performance.
Lay Summary: Imagine trying to complete a puzzle where some pieces are missing—this is what data scientists face daily when working with incomplete datasets. Missing information in medical records, survey responses, or business data can lead to flawed analyses and poor decisions. Current methods for filling these gaps treat all missing data the same way, like assuming puzzle pieces disappeared randomly. We created CACTI, a new machine learning approach that recognizes that data often goes missing in patterns. CACTI learns these real-world patterns by reusing observed missingness structures to improve its predictions for filling in missing data. CACTI also reads column descriptions to understand relationships between different types of information, much like understanding that "blood pressure" and "heart rate" are related health measurements. When tested on real datasets, CACTI outperformed the best existing methods by an average of 7.8%, reaching up to 13.4% improvement in the most complex cases. This means researchers and organizations can now extract more accurate insights from incomplete data, more reliable findings, better analysis and improved downstream decisions—all from the same imperfect datasets they already have.
Link To Code: https://github.com/sriramlab/CACTI
Primary Area: General Machine Learning->Unsupervised and Semi-supervised Learning
Keywords: Tabular data imputation, Inductive bias, Context awareness, Data-driven masking
Submission Number: 13310
Loading