\section{Attention Maps on Different Resolutions}
\label{sec:layers}
\input{figures/layers}
As can be seen in \figureref{fig:layers}, models that have not learned a proper alignment between the image and text modalities can benefit from only selecting the innermost attention layers (\textit{i.e.}, the layers with the lowest resolution. 
However, this is mainly due to lower resolutions naturally being closer to an activation area compared to fine-grained features.
Inspecting all layers highlights that, in truth, such models have not learned a proper alignment between the two modalities and instead focus on unnecessary details such as the ribcage.
In contrast, CXR-BERT has learned a proper alignment, so we can simply average over all layers to get our results.