Abstract: We introduce OpenLEAF, a new benchmark designed for the emerging open-domain interleaved image-text generation task. This task aims to generate arbitrarily interleaved multimodal content from input queries. It goes beyond commonly seen single-modality image or text generation, thereby enabling various novel applications by creating content such as visual storybooks and how-to instructions. Despite the importance of this task, no established benchmark exists due to the challenges in defining evaluation scenarios and formulating effective metrics. To introduce and facilitate the new task of interleaved image-text generation, we create a new dataset covering queries with various input-output formats and 10 different application scenarios. We also propose a novel evaluation pipeline named "detection-summarization-scoring," which breaks down the evaluation into multiple reasoning steps. This pipeline leverages large multimodal models (LMMs) to thoroughly evaluate ten aspects of the generated content, which leads to the final rating. With experiments on a proposed agent system, we demonstrate that our evaluation method aligns closely with human judgments, thus together with the dataset, offering the research community a valuable benchmark for exploring interleaved image-text generation.