Abstract: Highlights •A new dual temporal model is proposed for image captioning.•Word-conditional semantic attention is proposed for functional-words•generation.•A self-balancing model is exploited to balance the visual and semantic attention.
Loading