Keywords: segmentation, reliability, robustness
TL;DR: We built a framework to evaluate different reliability layers for semantic segmentation tasks. We find that replacing a deterministic layer by a GP layer improves model robustness across multiple types of dataset shifts.
Abstract: Recent work has shown the importance of reliability, where model performance is assessed under stress conditions pervasive in real-world deployment. In this work, we examine reliability tasks in the setting of semantic segmentation, a dense output problem that has typically only been evaluated using in-distribution predictive performance---for example, the mean intersection over union score on the Cityscapes validation set. To reduce the gap toward reliable deployment in the real world, we compile a benchmark involving existing (and newly constructed) distribution shifts and metrics. We evaluate current models and several baselines to determine how well segmentation models make robust predictions across multiple types of distribution shift and flag when they don’t know.