MAViS2: A Multi-Agent Framework for Interactive and Adaptive Long-Sequence Video Storytelling

ACL ARR 2026 January Submission3352 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video Generation, Multi-modal Language Model Agent, Long-Sequence Video Storytelling, Video Agent, Human-in-the-loop
Abstract: Existing long-sequence video generation frameworks often overlook scriptwriting, rely on a fixed T2I+I2V paradigm, lack post-production support, and offer limited user interaction, resulting in poor viewing experience and limited user controllability. To address these limitations, we propose MAViS2, a multi-agent framework for interactive and adaptive long-sequence video storytelling. MAViS2 decomposes the video creation process into three coordinated stages: scriptwriting, video clip generation, and post-production, each handled by multiple specialized agents. In the scriptwriting stage, we propose a Scriptwriting Workflow that progressively improves the expressiveness of the scripts. In the video generation stage, MAViS2 uses Adaptive Generation Planning to select an appropriate generation strategy for each shot and dynamically adjusts it based on the global memory and the evaluation of generated results, thereby significantly increasing visual diversity while reducing constraints on scriptwriting. In the post-production stage, MAViS2 integrates basic video editing, voice-over synthesis, background music matching, and subtitle composition to improve the completeness of the final output. Moreover, MAViS2 natively supports Fine-grained Human-in-the-loop Control, allowing users to intervene and make fine-grained adjustments at any stage, thereby flexibly exploring diverse visual narratives and creative directions. Experimental results show that MAViS2 outperforms existing methods in terms of visual quality, narrative coherence, and overall viewing experience. MAViS2 offers a novel solution for long-sequence video storytelling and is capable of performing a wide range of storytelling tasks, including end-to-end generation, video understanding, Wikipedia-based external knowledge retrieval, prequel and sequel creation, and video remaking.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: NLP Applications,Generation,Dialogue and Interactive Systems
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 3352
Loading