Abstract: Highlights•A new task of video-based instruction generation that requires commonsense knowledge.•A new model using implicit and explicit commonsense to enhance sentence prediction.•We analyze the contributions of knowledge made toward improved caption quality.•Our model also achieves state-of-the-art performance on dense video captioning tasks.
Loading