VIBE: Annotation-Free Video-to-Text Information Bottleneck Evaluation for TL;DR

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Human-AI Interaction, Video-to-Text Evaluation, Information Bottleneck
TL;DR: This paper proposes VIBE, an annotation-free method that selects video summaries for human decision-making by scoring task relevance and visual grounding without retraining.
Abstract: Many decision-making tasks, where both accuracy and efficiency matter, still require human supervision. For example, tasks like traffic officers reviewing hour-long dashcam footage or researchers screening conference videos can benefit from concise summaries that reduce cognitive load and save time. Yet current vision-language models (VLMs) often produce verbose, redundant outputs that hinder task performance. Existing video caption evaluation depends on costly human annotations and overlooks the summaries' utility in downstream tasks. We address these gaps with $\underline{\textbf{V}}$ideo-to-text $\underline{\textbf{I}}$nformation $\underline{\textbf{B}}$ottleneck $\underline{\textbf{E}}$valuation (VIBE), an annotation-free method that scores VLM outputs using two metrics: $\textit{grounding}$ (how well the summary aligns with visual content) and $\textit{utility}$ (how informative it is for the task). VIBE selects from randomly sampled VLM outputs by ranking them according to the two scores to support effective human decision-making. Human studies on $\texttt{LearningPaper24}$, $\texttt{SUTD-TrafficQA}$, and $\texttt{LongVideoBench}$ show that summaries selected by VIBE consistently improve performance—boosting task accuracy by up to $61.23$% and reducing response time by $75.77$% compared to naive VLM summaries or raw video.
Primary Area: Evaluation (e.g., methodology, meta studies, replicability and validity, human-in-the-loop)
Submission Number: 13020
Loading