Abstract: Highlights•A more effective two-stage pre-training strategy is used for video description.•A visual and language representation enhancing method is proposed.•A visual sequential mean pooling method is proposed to further improve performance.
Loading