VLM-SlideEval: Evaluating VLMs on Structured Comprehension and Perturbation Sensitivity in PPT

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-language models; Multimodal evaluation; Document understanding; Slide presentations (PowerPoint); Agentic evaluation;Critic-in-the-loop; LLM-as-a-judge; Robustness & controlled perturbations; Pixel-accurate extraction; Layout & typography; Evaluation framework
TL;DR: We test VLMs on slides - from pixel-level extraction and perturbations to deck narrative, revealing gaps that call for calibrated critic-in-the-loop evaluators.
Abstract: Vision-language models (VLMs) are increasingly used to evaluate multimodal content, including presentation slides, yet their slide-specific understanding remains underexplored \emph{despite their growing role as critics in agentic, model-forward pipelines}. We introduce \textbf{VLM-SlideEval}, an evaluation framework that probes VLMs along three axes: (1) element-level extraction from slide images aligned to ground truth; (2) robustness to controlled perturbations in geometry, style, and text; and (3) higher-level comprehension, such as recovering a deck's narrative order from shuffled slides. Using publicly available decks from Zenodo\footnote{\scriptsize{\url{https://zenodo.org}; HF viewer: \hfurl}}, we standardize ground-truth element metadata from PowerPoint XML and live renderings into a unified, verifiable schema. Empirically, VLMs underperform on pixel-accurate extraction and show non-trivial agreement, fidelity and consistency under controlled perturbations, while performing better on single-slide content understanding; however, they do not reliably capture narrative structure across slides. These results highlight the limits of current VLMs for slide evaluation and motivate calibrated, critic-in-the-loop evaluators that drive iterative refinement and selection in agentic pipelines.
Submission Number: 175
Loading