TL;DR: We propose a new sparse and structured attention mechanism, TVmax, which promotes sparsity and encourages the weight of related adjacent locations to be the same.
Abstract: Visual attention mechanisms have been widely used in image captioning models. In this paper, to better link the image structure with the generated text, we replace the traditional softmax attention mechanism by two alternative sparsity-promoting transformations: sparsemax and Total-Variation Sparse Attention (TVmax). With sparsemax, we obtain sparse attention weights, selecting relevant features. In order to promote sparsity and encourage fusing of the related adjacent spatial locations, we propose TVmax. By selecting relevant groups of features, the TVmax transformation improves interpretability. We present results in the Microsoft COCO and Flickr30k datasets, obtaining gains in comparison to softmax. TVmax outperforms the other compared attention mechanisms in terms of human-rated caption quality and attention relevance.
Keywords: Sparsity, attention, structured attention, total variation, fused lasso, image captioning
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:2002.05556/code)
Original Pdf: pdf
7 Replies
Loading