Sufficient Representations for Categorical Variables

Jonathan Johannemann, Vitor Hadad, Susan Athey, Stefan Wager

14 Sept 2020OpenReview Archive Direct UploadReaders: Everyone

Abstract: Many learning algorithms require categorical data to be transformed into real vectors before it can be used as input. Often, categorical variables are encoded as one-hot or dummy vectors. However, this mode of representation can be wasteful since it adds many low-signal regressors, especially when the number of unique categories is large. In this paper, we investigate simple alternative solutions for universally consistent estimators that rely on lower-dimensional real-valued representations of categorical variables that are sufficient in the sense that no predictive information is lost. We then compare preexisting and proposed methods on simulated and observational datasets.

0 Replies