Abstract: An image captioning method is usually combined with an attention mechanism, which performs more effectively than the one without it. However, this method is still under performed compared with the artificial image captioning. For example, it may wrongly recognize the salient image objects, incorrectly or superficially understand the image objects, and mistakenly ignore the background information, which lead to the propagation and accumulation errors when generating sentences. One way that helps solve these problems is to adopt deliberate attention mechanism in image captioning. This paper proposes a deliberate multi-attention network (DMAN) model with multi-modal mechanism. This mechanism generates the image captions by combining the information from texts, visual sentinel and attention information. The textual information comes from words through LSTM, the attention information is derived by another LSTM from extracted image features. Visual sentinel is derived from the first deliberate attention loop, and then integrated into the content vector, which is fused in the multi-modal generation step. Visual sentinel mechanism can adjust the ratio of the degree between text information and visual information in the generation step. With the deliberate multi-attention, the mechanism recognizes and understands the image salient objects more accurately, and thus it solves the propagation and accumulation errors. By evaluating with the MSCOCO dataset, the experimental results show that the proposed DMAN model is overall better than Bottom-Up-Top-Down model under four quantitative indicators. Compared with the previous deliberate attention model, DMAN model can pay attention to more detailed part of the image and generate better captioning results.
Loading