Video Content Summarization with Large Language-Vision Models

Kelley Lynch, Bohan Jiang, Ben Lambright, Kyeongmin Rim, James Pustejovsky

Published: 01 Jan 2024, Last Modified: 19 May 2025IEEE Big Data 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We present a modular pipeline for summarizing broadcast news videos using large language and vision models, specifically integrating Whisper for ASR, TransNetV2 for shot segmentation, LLaVA for image captioning, and LLaMA for generating structured summaries. Implemented within the CLAMS platform using the Multimedia Interchange Format (MMIF) for component interoperability, our approach combines ASR transcriptions and image captions to enhance metadata extraction. We evaluated our pipeline with automated metrics based on user-generated Youtube video descriptons as well as human assessments. Our analysis highlights challenges with automated metrics and emphasizes the value of human evaluation for nuanced assessment. This work demonstrates the effectiveness of multimodal summarization for video metadata extraction and paves the way for enhanced video accessibility.