Fine-Grained Captioning of Long Videos through Scene Graph Consolidation

Sanghyeok Chu; Seonguk Seo; Bohyung Han

Fine-Grained Captioning of Long Videos through Scene Graph Consolidation

Sanghyeok Chu, Seonguk Seo, Bohyung Han

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose a novel framework for long video captioning based on graph consolidation.

Abstract: Recent advances in vision-language models have led to impressive progress in caption generation for images and short video clips. However, these models remain constrained by their limited temporal receptive fields, making it difficult to produce coherent and comprehensive captions for long videos. While several methods have been proposed to aggregate information across video segments, they often rely on supervised fine-tuning or incur significant computational overhead. To address these challenges, we introduce a novel framework for long video captioning based on graph consolidation. Our approach first generates segment-level captions, corresponding to individual frames or short video intervals, using off-the-shelf visual captioning models. These captions are then parsed into individual scene graphs, which are subsequently consolidated into a unified graph representation that preserves both holistic context and fine-grained details throughout the video. A lightweight graph-to-text decoder then produces the final video-level caption. This framework effectively extends the temporal understanding capabilities of existing models without requiring any additional fine-tuning on long video datasets. Experimental results show that our method significantly outperforms existing LLM-based consolidation approaches, achieving strong zero-shot performance while substantially reducing computational costs.

Lay Summary: Recent advances in AI models have led to impressive progress in caption generation for images and short video clips; however, captioning long videos remains challenging due to their limited temporal understanding. To address these challenges, we introduce a novel framework for long video captioning based on graph consolidation. The proposed framework achieves stronger captioning performance and lower computational cost without requiring additional training on long videos.

Primary Area: Applications->Computer Vision

Keywords: long video captioning, zero-shot video captioning, scene graph

Submission Number: 15674

Loading