Event and Entity Extraction from Generated Video Captions

Johannes Scherer, Deepayan Bhowmik, Ansgar Scherp

Published: 01 Jan 2023, Last Modified: 06 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: Annotation of multimedia data by humans is time-consuming and costly, while reliable automatic generation of semantic metadata is a major challenge. We propose a framework to extract semantic metadata solely from automatically generated video captions. As metadata, we consider entities, the entities’ properties, relations between entities, and the video category. Our framework combines automatic video captioning models with natural language processing (NLP) methods. We use state-of-the-art dense video captioning models with masked transformer (MT) and parallel decoding (PVDC) to generate captions for videos of the ActivityNet Captions dataset. We analyze the output of the video captioning models using NLP methods. We evaluate the performance of our framework for each metadata type, while varying the amount of information the video captioning model provides. Our experiments show that it is possible to extract high-quality entities, their properties, and relations between entities. In terms of categorizing a video based on generated captions, the results can be improved. We observe that the quality of the extracted information is mainly influenced by the dense video captioning model’s capability to locate events in the video and to generate the event captions.

External IDs:doi:10.1007/978-3-031-40837-3_17