Multimodal Content Alignment with LLM for Visual Presentation of Papers

Huiying Hu, Zhicheng He, Yixiao Zhou, Tongwei Zhang, Xiaoqing Lyu

Published: 01 Jan 2026, Last Modified: 16 Jan 2026CrossrefEveryoneRevisionsCC BY-SA 4.0
Abstract: The rapid growth of scientific literature creates a pressing need for tools that can efficiently distill complex research papers into accessible formats. The existing approaches for automated slide generation from papers primarily focuses on textual content, but overlook the critical role of visual elements in scientific communication. We propose Paper2PPT, a novel framework that generates the visual presentation of scientific papers, prioritizing the integration of visual elements and their explanatory contexts through systematic cross-modal alignment. Our approach addresses key challenges in aligning visual elements with their associated text, including robust caption localization, morphology-aware OCR consolidation, and caption-anchored visual-semantic reasoning. By leveraging document structure heuristics and spatial reasoning, the framework establishes precise figure-text relationships without requiring multimodal training data. The experiments show that the proposed method can generate a more accurate and comprehensive visual representation of scientific papers.
Loading