Abstract: In this work, we propose the use of ''aligned video captions'' as an intelligent mechanism for integrating video content into retrieval augmented generation (RAG) based AI assistant systems. These captions serve as an efficient representation layer between videos and large language models (LLMs), describing both visual and audio content while requiring significantly less context window space compared to traditional frame sampling approaches. We demonstrate how this representation enables more effective agent-based retrieval and generation capabilities, with captions that can be dynamically adapted through targeted prompting or fine-tuning of the underlying models. Our empirical evaluation across multiple LLM configurations shows that this approach achieves comparable performance to direct video processing while being more computationally efficient and easier to reason about in downstream tasks. Notably, the approach shows particular strength in procedural content like How-To videos, where aligned captions significantly outperform speech-only baselines.
Loading