\def \codeURL{https://github.com/martentyrk/mlrc2022hirota}
\def \codeDOI{}
\def \dataURL{}
\def \dataDOI{}
\def \editorNAME{}
\def \editorORCID{}
\def \reviewerINAME{}
\def \reviewerIORCID{}
\def \reviewerIINAME{}
\def \reviewerIIORCID{}
\def \dateRECEIVED{03 February 2023}
\def \dateACCEPTED{}
\def \datePUBLISHED{}
\def \articleTITLE{Exploring the Explainability of Bias in Image Captioning Models}
\def \articleTYPE{Replication / ML Reproducibility Challenge 2022}
\def \articleDOMAIN{}
\def \articleBIBLIOGRAPHY{bibliography.bib}
\def \articleYEAR{2023}
\def \reviewURL{https://openreview.net/forum?id=N9Wn91tE7D0}
\def \articleABSTRACT{
\subsubsection*{Scope of Reproducibility}

The main objective of this paper is to reproduce and verify the following claims made in the original paper: (1) According to the LIC metric, all evaluated image captioning models amplify gender and racial bias, (2) the proposed LIC metric is robust against encoders, and (3) captioning model NIC+Equalizer amplifies gender bias beyond baseline.

\subsubsection*{Methodology}
We reproduced the results of the original authors with only minor modifications to the code they made available. We contribute to their research by highlighting a noteworthy limitation in the used data split and propose an integrated gradients method to increase explainability, allowing users to understand predictions better using the Captum library for Pytorch. As for the computational requirements, all experiments were run on a cluster with a NVIDIA Titan RTX GPU and the time required to run a total of 720 models 
was $\sim$98 hours.

\subsubsection*{Results}

The results we obtained showed the same patterns 
as in the original authors' work. All our results 
were in the range of $\pm1$ LIC score units compared to the original work, which supports the claims on the gender and racial bias amplification, robustness against encoders, and amplification by NIC+Equalizer beyond baseline. As for our contributions, we show that 
the attribution scores obtained by using integrated gradients follow similar 
patterns in terms of gender amplification for all evaluated language models, providing additional support for the proposed LIC metric.\\
During data set analysis we observed a
leakage in the original data split being used, resulting in identical captions occurring multiple times in both the training and test set. The removal of already seen captions during training
from the test set reduced its size by 62.4\% on average and caused a decline in $LIC_M$ scores of approximately 5 units.

\subsubsection*{What was easy}
Reproducing the results using the original provided code offered no difficulties.

\subsubsection*{What was difficult}
Finding a useful angle of contribution to the paper proved to be challenging. After we had decided upon using our selected explainability method, implementing and modifying existing code was more work than expected.
}
\def \replicationCITE{}
\def \replicationBIB{}
\def \replicationURL{}
\def \replicationDOI{}
\def \contactNAME{}
\def \contactEMAIL{}
\def \articleKEYWORDS{rescience c, rescience x}
\def \journalNAME{ReScience C}
\def \journalVOLUME{9}
\def \journalISSUE{2}
\def \articleNUMBER{}
\def \articleDOI{}
\def \authorsFULL{Anonymous Authors}
\def \authorsABBRV{Anonymous}
\def \authorsSHORT{Anonymous}
\title{\articleTITLE}
\date{}
\author[1,\orcid{0000-0000-0000-0000}]{Anonymous}
\affil[1]{Anonymous Institution}

