Keywords: Nondeterminism, Variability, Instability, Randomness.
Abstract: Scope of Reproducibility: The claims of the paper are threefold: (1) Cecilia made the surprising yet intriguing discovery that all sources of nondeterminism exhibit a similar degree of variability in the model performance of a neural network throughout the training process. (2) To explain this fact, they have identified model instability during training as the key factor contributing to this phenomenon. (3) They have also proposed two approaches (Accelerated Ensembling and Test-Time Data Augmentation) to mitigate the impact on run-to-run variability without incurring additional training costs. In the paper, the experiments were performed on two types of datasets (image classification and language modelling). However, due to intensive training and time required for each experiment, we will only consider image classification for testing all three claims. Methodology: Our approach to investigating the claims made in the paper can be divided into three parts: (1) Replication: we used the publicly available code and adapted it to our experimental environment with some modifications to replicate the results; (2) Ablation study: we tried to use different parameters, reducing the total implementation time to less than half compared to the original study, while keeping the central claim intact; (3) Generalization: we studied the authors' claim on a much more complex dataset and architecture to gain insights on the reproducibility of the conclusion. All experiments necessarily required extensive training, with a single experiment alone requiring $490$ hours of $2$ Nvidia Tesla V100 16GB (i.e., $700$ trained models). Result: With our tests and the obtained results, we confirm that all individual and combined sources of nondeterminism have similar effects on model variability and that instability in neural network optimization is the main reason for this phenomenon. However, our results show some discrepancies in the reduction of variability by test-time data augmentation (TTA) and accelerated ensembling (claim 3 above). Like the original study, we show that these approaches successfully reduce variability, but the degree of reduction is reported as $61\%$, whereas our study reports $51\%$ as the highest value. Despite some small differences, the third claim remains and we support it. What was easy: The authors have made the source code publicly available in the GitLab repository. Even without extensive documentation, the reimplementation of the experiments was straightforward and required little effort. Moreover, the paper's clearly presented details significantly reduced the effort required to set up the experimental configurations. The use of regular neural network training and widely used datasets was the icing on the cake to follow the implementation. This allowed us to explore other new aspects of the method. What was difficult: Although the implementation was easy to comprehend and intuitive with the resources provided, the validation of some baselines proved to be computationally intensive and time-consuming, requiring multiple runs. In particular, the variability analysis required training $100$ models each for $500$ epochs to verify the role of a single source of nondeterminism. Nevertheless, we managed to maintain the original settings, but we could not run multiple iterations to gain more confidence in the results. Communication with original authors: At the beginning of our reproducibility study, we contacted the original authors once. The basic questions about the experimental settings were answered and the foundation for the rest of our experiments was laid. In addition, we also referred to their post and answers available on the OpenReview portal.
Paper Url: https://paperswithcode.com/paper/nondeterminism-and-instability-in-neural
Paper Venue: ICML 2021
Supplementary Material: zip