Abstract: Highlights•A knowledge encoder offers high-level representation for video summarization.•The multimodal fusion module exploits the complementary features across modalities.•Experimental results show that the proposed method outperforms surpasses baselines.
Loading