Abstract: Scope of Reproducibility The authors claim that their proposed method is able to, given an ensemble of deep neural networks, capture the uncertainty estimation and decomposition capabilities of the ensemble into a single model. The authors also claim that this only results in a small reduction in classification performance compared to the ensemble. We examine these claims by reproducing most of the authors' experiments on the CIFAR-10 dataset. Methodology The proposed method was re-implemented in tf.keras. The surrounding data pipelines, pre-processing, and experimentation code were also re-implemented. As in the original paper, the models were based on VGG-16 networks trained from scratch with random initialization. Training and evaluation was done on two consumer-grade GPUs, for a total of 273 hours. Results Our findings support the authors' central claims. In terms of uncertainty estimation our EnDD \ achieved $(99\pm 1)$ \% of the AUC-ROC of our ensemble on the OOD-detection task. The corresponding value in the original paper was $(100\pm 1)$ \%. In terms of classification our EnDD \ had $(16\pm1)$\% higher error than our ensemble. The corresponding values in the original paper was $(11 \pm 6$)\%. Other metrics showed similar agreement, but, significantly, in the OOD-detection task our EnD performed at least as well as our EnDD. This is in stark contrast with the original paper. We also took a novel approach to visualizing the uncertainty decomposition by plotting the resulting distributions on a simplex, offering a visual explanation to some surprising results in the original paper, while mostly supporting the authors' intuitive justifications for the model. What was easy The original paper features a thorough mathematical formulation of the method, aiding conceptual understanding. The datasets used by the authors are publicly available. The use of the simpler datasets also meant that it was computationally feasible for us to reproduce these results. The base model used is well known with several implementation available, allowing us to focus on the novel aspects of the method. What was difficult While the theoretical explanations of the method are excellent, we initially found it hard to translate this into an implementation. Our difficulty was likely caused by our inexperience with the subject matter. Nonetheless, a pseudocode, such as the one we have provided, would havee simplified the re-implementation. We were not able to reproduce the results on some of the datasets due to limited computational resources. Communication with original authors We did not contact the original authors directly, but we did refer to a public GitHub and blog post created by one of the authors. At the same time as submitting this report to the ML Reproducibility Challenge 2020 we also sent a copy to the authors and asked for their feedback.
Paper Url: https://openreview.net/forum?id=BygSP6Vtvr
Supplementary Material: zip