OpenLEAF: Benchmarking Open-Domain Interleaved Image-Text Generation

Anonymous

OpenLEAF: Benchmarking Open-Domain Interleaved Image-Text Generation

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: We introduce OpenLEAF, a benchmark designed for the open-domain interleaved image-text generation task. This task aims to generate arbitrarily-interleaved multimodal content from input queries. It goes beyond single-modality image or text generation, thereby enabling various novel applications by creating content such as visual storybooks and how-to instructions. Despite the importance of the task, there lacks established benchmark due to the challenges in defining evaluation scenarios and formulating effective metrics. In this study, we collect a dataset covering queries with various input-output formats and $10$ different application scenarios. We also propose an evaluation pipeline named ``detection-summarization-scoring,'' which breaks down the evaluation into multiple reasoning steps. This pipeline leverages large multimodal models (LMMs) to thoroughly evaluate ten aspects of the generated content, which lead to the final rating. With experiments on a proposed agent system, we demonstrate that our evaluation method aligns closely with human judgments, offering a robust benchmark for assessing interleaved image-text generation.

Paper Type: long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Contribution Types: Data resources

Languages Studied: English

0 Replies

Loading