Abstract: Sports video captioning in real application scenarios requires both entities and specific scenes. However, it is difficult to extract this fine-grained information solely from the video content. This paper introduces an Explicit & Implicit Knowledge-Augmented Network for Entity-Aware Sports Video Captioning (EIKA), which leverages both explicit game-related knowledge (i.e., the set of involved player entities) and implicit visual scene knowledge extracted from the training set. Our innovative Entity-Video Interaction Module (EVIM) and Video-Knowledge Interaction Module (VKIM) are instrumental in enhancing the extraction of entity-related and scene-specific video features, respectively. The spatiotemporal information in video is encoded by introducing the Spatial-Temporal Modeling Module (STMM). And the designed Scene-To-Entity (STE) decoder fully utilizes the two kinds of knowledge to generate informative captions with the distributed decoding approach. Extensive evaluations on the VC-NBA-2022, Goal and NSVA datasets demonstrate that our method has the leading performance compared with existing methods.
Loading