Paper2Slide: A Multi-Agent Framework for Automatic Scientific Slide Generation

07 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multi-agent, large vision language models
Abstract: Generating academic slides from scientific papers is often challenging as it requires reasoning over long context and carefully planning layouts. However, most prior work just treat it as a text summarization task, overlooking the inherent complexity of intra-slide visual design. To tackle this challenge, we propose \textbf{SlideGen}, a modular, visual-in-the-loop agentic pipeline for paper-to-slide generation, which utilizes six VLM workers to collaborate together. It plans the outline (Outliner), matchs figures/tables/equations to outline bullets (Mapper/Formulizer), lays out pages via template selection (Arranger), writes notes (Speaker), and refines with merging and emphasis (Refiner). To better evaluate the quality of the generated slides, we further release the \textbf{Paper2Slide Benchmark} of paper–slide pairs and provide automated evaluation protocols: \textit{ (i)} Visual Aesthetics -- a geometry-aware density score for layout balance and spacing, \textit{(ii)} Holistic Assessment -- a VLM-as-judge criteria over content, design, and coherence, enabling reliable, end-to-end assessment; and \textit{(iii)} Communication Effectiveness -- we use SlideQA, a question answering task that measures the ability of presentation slides to convey information; \textit{(iv)} Textual Coherence -- textual fluency. Across a diverse set of strong baselines, \textbf{SlideGen} demonstrates strong results across all evaluation metrics and outperforms various competing methods, offering human-level slide-making capabilities. Our framework identifies promising directions for building the next generation of end-to-end slide generators. The code is available for full reproducibility at \href{https://github.com/anaymoysuser/SlideGen}{Anonymous Github.}
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 2854
Loading