Paper2Video: Automatic Video Generation from Scientific Papers

Published: 28 Sept 2025, Last Modified: 09 Oct 2025SEA @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI for Research; Benchmark; Multi-Agent; Video Generation;
TL;DR: Paper2Video -- Automatic Video Generation from Scientific Papers via Multi-Agent System
Abstract: Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2–10 minutes video. Unlike natural video, presentation videos generation involve distinctive challenges: long-context inputs from research papers, dense multimodal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talkers. To address these challenges, we introduce PaperVideo, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design three tailored evaluation metrics—Meta Similarity, PresentArena, and PresentQuiz—to measure presentation engagement and knowledge conveyance. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates Beamer slide generation with layout refinement by Monte-Carlo tree search, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation. Our datasets, agent, and codes will be fully open-sourced to power the community.
Archival Option: The authors of this submission do *not* want it to appear in the archival proceedings.
Submission Number: 44
Loading