Replication: Fairness without demographics through Adversarially Reweighted Learning

Erik Jenner; Tom Lieberum; Frederik Paul Nolte; Nadja Rutsch

Replication: Fairness without demographics through Adversarially Reweighted Learning

Erik Jenner, Tom Lieberum, Frederik Paul Nolte, Nadja Rutsch

31 Jan 2021 (modified: 05 May 2023)ML Reproducibility Challenge 2020 Blind SubmissionReaders: Everyone

Abstract: Scope of Reproducibility We test the claim that Adversarially Reweighted Learning (ARL) improves Rawlsian Max-Min fairness for supervised classification compared to previous methods and simple baselines in the case of missing demographic data. Methodology We completely re-implemented all models and training routines in PyTorch, using the paper and the published code as a reference. We compared our implementation to the one provided by the authors and then reproduced the hyperparameter search as described in the paper using our implementation. In addition, we applied the method to image data, in order to test how well it generalizes across modalities. Due to the general simplicity of the used models, it was straightforward to implement the codebase from scratch and run the experiments. Parts of the experiments were run on a computing cluster with 12 CPUs and parts on Google Colab (GPU). Overall, it took 4 weeks to produce the results, with four people working on it. A complete grid search took 4 hours and producing the final results with fixed hyperparameters 5 hours, of which a single training run of one model on one dataset took about 2 minutes. Results We could not replicate the advantage of ARL over the investigated baselines. This seems to be mainly due to a better baseline performance than reported in the paper. Our baseline's performance is on average 2.615 standard deviations higher than the authors'. Our ARL results do not deviate significantly from the papers’ result, they are on average 0.841 standard deviations higher. What was easy ARL itself was very easy to implement. We were also able to run the code provided by the authors quite easily. Running the experiments required little computational resources because of the small datasets and models. What was difficult Pre-processing the data took time because the notebooks provided by the authors contained some errors that we needed to debug. For the replication of the grid search for hyperparameter optimization of all models on all datasets, we had to limit the training duration to a maximum of 5k steps in order to finish all experiments on time. Communication with original authors We asked the authors about details regarding their training procedure. The authors provided us with the missing details and adapted their GitHub repository as a response to our communication.

Paper Url: https://openreview.net/forum?id=SiHVX35sDT

Supplementary Material: zip

7 Replies

Loading