Paper2Video: Automatic Video Generation from Scientific Papers

Zeyu Zhu; Kevin Qinghong Lin; Mike Zheng Shou

Paper2Video: Automatic Video Generation from Scientific Papers

Zeyu Zhu, Kevin Qinghong Lin, Mike Zheng Shou

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: AI for Research; Benchmark; Multi-Agent; Video Generation;

TL;DR: Paper2Video -- Automatic Video Generation from Scientific Papers via Multi-Agent System

Abstract: Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: long-context inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce Bench, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics—Meta Similarity, PresentArena, PresentQuiz, and IP Memory—to measure how videos convey the paper's information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel Tree Search Visual Choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Bench demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 5550

Loading