Keywords: calibration, distribution shift, ensembles
TL;DR: Robustness interventions (such as removing spurious correlations) improve OOD accuracy at the cost of decreasing ID accuracy - we show that calibrated ensembles are a simple and effective solution to this problem
Abstract: We often see undesirable tradeoffs in robust machine learning where out-of-distribution (OOD) accuracy is at odds with in-distribution (ID) accuracy. A ‘robust’ classifier obtained via specialized techniques like removing spurious features has better OOD but worse ID accuracy compared to a ‘standard’ classifier trained via vanilla ERM. On six distribution shift datasets, we find that simply ensembling the standard and robust models is a strong baseline---we match the ID accuracy of a standard model with only a small drop in OOD accuracy compared to the robust model. However, calibrating these models in-domain surprisingly improves the OOD accuracy of the ensemble and completely eliminates the tradeoff and we achieve the best of both ID and OOD accuracy over the original models.