Abstract: Highlights•We propose a fully-attentive and iterative network for controllable image captioning.•We design novel attention operators that can deal with region-based control signals.•We introduce a decoder which explicitly focuses on each part of the control signal.•State-of-the-art performance on both image and video controllable captioning.
Loading