Abstract: We introduce OpenLEAF, a benchmark designed for the open-domain interleaved image-text generation task. This task aims to generate arbitrarily-interleaved multimodal content from input queries. It goes beyond single-modality image or text generation, thereby enabling various novel applications by creating content such as visual storybooks and how-to instructions. Despite the importance of the task, there lacks established benchmark due to the challenges in defining evaluation scenarios and formulating effective metrics. In this study, we collect a dataset covering queries with various input-output formats and $10$ different application scenarios. We also propose an evaluation pipeline named ``detection-summarization-scoring,'' which breaks down the evaluation into multiple reasoning steps. This pipeline leverages large multimodal models (LMMs) to thoroughly evaluate ten aspects of the generated content, which lead to the final rating. With experiments on a proposed agent system, we demonstrate that our evaluation method aligns closely with human judgments, offering a robust benchmark for assessing interleaved image-text generation.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Data resources
Languages Studied: English
0 Replies
Loading