- Abstract: While neural networks with attention mechanisms have achieved superior performance on many natural language processing tasks, it remains unclear to which extent learned attention resembles human visual attention. We study the similarity between human visual and neural attention and analyze if neural attention-based methods perform better if they mimic human attention. To this end, we compare state-of-the-art networks based on long short-term memory (LSTM), convolutional neural (CNN) and XLNet Transformer architectures on a question answering task. We evaluate all methods on a novel 23-participant dataset of eye tracking data recorded while reading movie plots. We find that while higher similarity to human attention and performance significantly correlates to the LSTM and CNN this does not hold true for the XLNet -- despite the fact that the XLNet performs best on this challenging task. Our work not only shows that different architectures seem to learn rather different neural attention but also that similarity of neural to human attention is not necessarily helpful and hence desirable.