Keywords: webpage generation, multimodal model
TL;DR: MMWebGen-Bench is a benchmark for evaluating multimodal webpage generation. Our evaluation finds that: agent-based method follows webpage instructions better, but MLLMs handle image consistency better.
Abstract: Multimodal generative models have advanced text-to-image generation and image
editing. Recent unified models (UMs) can even craft interleaved images and text.
However, the capacity of such models to support more complex, production-level
applications remains underexplored. Multimodal webpage generation stands out
as a representative, high-value, yet challenging instance—it requires the genera-
tion of consistent visual content and renderable HTML code. To this end, this pa-
per introduces MMWebGen to systematically benchmark the multimodal webpage
generation capacities of existing models. In particular, MMWebGen focuses on
the product showcase scenario, which imposes stringent demands on visual con-
tent quality and webpage layout. MMWebGen includes 130 test queries across
13 product categories; each query consists of a source image, a visual content in-
struction, and a webpage instruction. The task is to generate a product showcase
webpage including multiple consistent images in accordance with the source im-
age and instructions. Given the mixed-modality input-output nature of the task, we
consider two workflows for evaluation—one uses large language models (LLMs)
and image editing models to separately generate HTML code and images (editing-
based), while the other relies on UMs for co-generation (UM-based). Empirical
results show that editing-based approaches achieve leading results in webpage in-
struction following and content appeal, while UM-based ones may display more advantages in fulfilling visual content instructions. We also construct a super-
vised finetuning (SFT) dataset, MMWebGen-1k, with 1,000 groups of real prod-
uct images and LLM-generated HTML code. We verify its effectiveness on the
open-source UM BAGEL. The benchmark and dataset will be publicly available.
Primary Area: datasets and benchmarks
Submission Number: 10437
Loading