Keywords: dataset; unified model; knowledge based image editing
TL;DR: This paper studies Visual Question–Visual Answering (VQ-VA) and propose BAGEL-World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction.
Abstract: This paper studies Visual Question–Visual Answering (VQ-VA): generating an image, rather than text, in response to a user’s visual question---an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image.
To also bring this capability to open-source models, we introduce BAGEL-World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction.
Leveraging web-scale deployment, this pipeline crawls a massive amount of $\sim$1.8M high-quality, interleaved image–text samples for model training. For evaluation, we further release IntelligentBench, a human-curated benchmark that systematically assesses VQ-VA along the aspects of world knowledge, design knowledge and reasoning.
Training with BAGEL-World yields strong empirical gains: it helps LightBAGEL attain 45.0 on IntelligentBench, substantially surpassing the best prior open-source baselines (\emph{i.e.}, 6.81@LightBAGEL, 1.94@UniWorld-V1), and significantly narrowing the gap toward leading proprietary systems (\emph{e.g.}, 81.67@NanoBanana, 82.64@GPT-
Image). By releasing the full suite of model weights, datasets, and pipelines, we hope it will facilitate future research on VQVA.
Supplementary Material: pdf
Primary Area: datasets and benchmarks
Submission Number: 8802
Loading