Probabilistic Missing Value Imputation for Mixed Categorical and Ordered Data

Yuxuan Zhao; Alex Townsend; Madeleine Udell

Probabilistic Missing Value Imputation for Mixed Categorical and Ordered Data

Yuxuan Zhao, Alex Townsend, Madeleine Udell

Published: 31 Oct 2022, Last Modified: 08 Oct 2022NeurIPS 2022 AcceptReaders: Everyone

Keywords: Categorical data, missing value imputation, mixed data

TL;DR: New distribution model and imputation algorithms for mixed data containing categorical variables

Abstract: Many real-world datasets contain missing entries and mixed data types including categorical and ordered (e.g. continuous and ordinal) variables. Imputing the missing entries is necessary, since many data analysis pipelines require complete data, but challenging especially for mixed data. This paper proposes a probabilistic imputation method using an extended Gaussian copula model that supports both single and multiple imputation. The method models mixed categorical and ordered data using a latent Gaussian distribution. The unordered characteristics of categorical variables is explicitly modeled using the argmax operator. The method makes no assumptions on the data marginals nor does it require tuning any hyperparameters. Experimental results on synthetic and real datasets show that imputation with the extended Gaussian copula outperforms the current state-of-the-art for both categorical and ordered variables in mixed data.

Supplementary Material: pdf

12 Replies

Loading