Keywords: reproducibility, fairness, gender, race, LIC, LIC score, integrated gradients, MSCOCO, Pytorch, Captum, bias amplification, encoders, captioning models, LSTM, BERT
TL;DR: Our aim was reproduce and extend the paper "Quantifying Societal Bias Amplification in Image Captioning" by Hirota et al, which we successfully did.
Abstract: Scope of Reproducibility — The main objective of this paper is to reproduce and verify the following claims made in the original paper: (1) According to the LIC metric, all evaluated image captioning models amplify gender and racial bias, (2) the proposed LIC metric is robust against encoders, and (3) captioning model NIC+Equalizer amplifies gender bias beyond baseline. Methodology — We reproduced the results of the original authors with only minor modifications to the code they made available. We contribute to their research by highlighting a noteworthy limitation in the used data split and propose an integrated gradients method to increase explainability, allowing users to understand predictions better using the Captum library for Pytorch. As for the computational requirements, all experiments were run on a cluster with a NVIDIA Titan RTX GPU and the time required to run a total of 720 models was ∼98 hours. Results — The results we obtained showed the same patterns as in the original authors’ work. All our results were in the range of ±1 LIC score units compared to the original work, which supports the claims on the gender and racial bias amplification, robustness against encoders, and amplification by NIC+Equalizer beyond baseline. As for our contributions, we show that the attribution scores obtained by using integrated gradients follow similar patterns in terms of gender amplification for all evaluated language models, providing additional support for the proposed LIC metric. During data set analysis we observed a leakage in the original data split being used, resulting in identical captions occurring multiple times in both the training and test set. The removal of already seen captions during training from the test set reduced its size by 62.4% on average and caused a decline in LIC_M scores of approximately 5 units. What was easy — Reproducing the results using the original provided code offered no difficulties. What was difficult — Finding a useful angle of contribution to the paper proved to be challenging. After we had decided upon using our selected explainability method, implementing and modifying existing code was more work than expected.
Paper Url: https://arxiv.org/abs/2203.15395
Paper Venue: CVPR 2022
Confirmation: The report pdf is generated from the provided camera ready Google Colab script, The report metadata is verified from the camera ready Google Colab script, The report contains correct author information., The report contains link to code and SWH metadata., The report follows the ReScience latex style guides as in the Reproducibility Report Template (https://paperswithcode.com/rc2022/registration)., The report contains the Reproducibility Summary in the first page., The latex .zip file is verified from the camera ready Google Colab script
Paper Review Url: https://openreview.net/forum?id=N9Wn91tE7D0
Journal: ReScience Volume 9 Issue 2 Article 21