Abstract: Score-based generative models are essential in various machine learning applications, with strong capabilities in generation quality. In particular, high-order derivatives (scores) of data density offer deep insights into data distributions, building on the proven effectiveness of first-order scores for modeling and generating synthetic data, unlocking new possibilities for applications. However, learning them typically requires complete data, which is often unavailable in domains such as healthcare and finance due to data corruption, acquisition constraints, or incomplete records. To tackle this challenge, we introduce MissScore, a novel framework for estimating high-order scores in the presence of missing data. We derive objective functions for estimating high-order scores under different missing data mechanisms and propose a new algorithm specifically designed to handle missing data effectively. Our empirical results demonstrate that MissScore accurately and efficiently learns the high-order scores from incomplete data and generates high-quality samples, resulting in strong performance across a range of downstream tasks.
Lay Summary: Machine learning models often need a deep understanding of data patterns to generate realistic samples or make accurate predictions. One powerful approach uses *score functions*, which describe how the likelihood of data changes as input values vary. These scores are especially useful in advanced models such as diffusion-based generators. However, calculating them typically assumes that the data is fully observed, which is often not the case in real-world domains like healthcare or finance where missing values are common.
We developed **MissScore**, a new method that estimates not only first-order but also higher-order score functions directly from incomplete datasets. Unlike methods that rely on imputation or auxiliary models, MissScore works directly with observed data and is designed to handle a range of realistic missing data scenarios.
Our results show that MissScore generates high-quality synthetic data, improves sampling efficiency, and helps uncover causal relationships, all without requiring complete datasets. This opens the door to more reliable, efficient, and trustworthy machine learning in messy, real-world scenarios.
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: Generative models; Score matching; Tabular data
Submission Number: 5176