MMWebGen: Benchmarking Multimodal Webpage Generation

Zhihong Liu; Siqi Kou; Zheng Li; Ye Ma; Quan Chen; Peng Jiang; Kai Yu; Zhijie Deng

MMWebGen: Benchmarking Multimodal Webpage Generation

Zhihong Liu, Siqi Kou, Zheng Li, Ye Ma, Quan Chen, Peng Jiang, Kai Yu, Zhijie Deng

18 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: webpage generation, multimodal model

TL;DR: MMWebGen-Bench is a benchmark for evaluating multimodal webpage generation. Our evaluation finds that: agent-based method follows webpage instructions better, but MLLMs handle image consistency better.

Abstract: Multimodal generative models have advanced text-to-image generation and image editing. Recent unified models (UMs) can even craft interleaved images and text. However, the capacity of such models to support more complex, production-level applications remains underexplored. Multimodal webpage generation stands out as a representative, high-value, yet challenging instance—it requires the genera- tion of consistent visual content and renderable HTML code. To this end, this pa- per introduces MMWebGen to systematically benchmark the multimodal webpage generation capacities of existing models. In particular, MMWebGen focuses on the product showcase scenario, which imposes stringent demands on visual con- tent quality and webpage layout. MMWebGen includes 130 test queries across 13 product categories; each query consists of a source image, a visual content in- struction, and a webpage instruction. The task is to generate a product showcase webpage including multiple consistent images in accordance with the source im- age and instructions. Given the mixed-modality input-output nature of the task, we consider two workflows for evaluation—one uses large language models (LLMs) and image editing models to separately generate HTML code and images (editing- based), while the other relies on UMs for co-generation (UM-based). Empirical results show that editing-based approaches achieve leading results in webpage in- struction following and content appeal, while UM-based ones may display more advantages in fulfilling visual content instructions. We also construct a super- vised finetuning (SFT) dataset, MMWebGen-1k, with 1,000 groups of real prod- uct images and LLM-generated HTML code. We verify its effectiveness on the open-source UM BAGEL. The benchmark and dataset will be publicly available.

Primary Area: datasets and benchmarks

Submission Number: 10437

Loading