Keywords: agent, video intelligence, large language model
Abstract: Recent breakthroughs in visual AI have largely treated video tasks in isolation, with specialized models excelling at generation, editing, segmentation, or understanding individually. We introduce \textbf{UniVA}, a multi-agent framework for universal video intelligence that unifies video understanding, segmentation, editing, and generation in complex workflows. UniVA employs a Plan-and-Act dual-agent architecture: a planner agent decomposes high-level user requests into a sequence of video-processing steps, and executor agents carry out these steps using specialized modular tool servers (for video analysis, generation, editing, object tracking, \textit{etc.}). Through a multi-level memory design (global knowledge, task context, and user-specific memory), UniVA supports long-horizon reasoning and inter-agent communication while maintaining full traceability of each action.
This design enables iterative and composite video workflows (\textit{e.g.}, image $\rightarrow$ video generation $\rightarrow$ video editing $\rightarrow$ object segmentation $\rightarrow$ content composition) that were previously cumbersome to achieve with single-purpose models or monolithic video-language models. We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA-Bench are open-sourced to the community, with the aim of catalyzing next-generation video intelligence research.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 4064
Loading