Conditioning on "and nothing else": Simple Models of Missing Data between Naive Bayes and Logistic Regression

Jun 10, 2020 Submission readers: everyone
  • TL;DR: We introduce a model that is more general than but as simple as naive Bayes and Logistic regression, that conditions on the fact that something was not reported
  • Keywords: generative model, missing data, naive bayes, logistic regression, regression
  • Abstract: In situations where people report in a free-form way, we need to condition on the fact that someone \emph{did not} report something. While we need to take into account that something was not reported, often there are too many statements that could be reported to consider each one; we only want to reason about those that were reported. In this paper we start with two simple, common models, namely Naive Bayes and logistic regression, which are equivalent models that are trained differently as to how missing data is handled. Naive Bayes is traditionally trained in a generative way, to make optimal predictions assuming only one value is observed (and making independence assumptions for the rest) and logistic regression is traditionally trained in a discriminative way, assuming no data is missing. It is generally assumed that these are qualitatively different, but in this paper we show there is a continuum between them. In particular, we show a model that is more general than both, but still simple, that can be trained to condition on missing data. In particular, it conditions on ``and nothing else [was reported]'' enabling us to avoid reasoning about the myriad of things that were not reported, but still take them into account.
0 Replies