Bagel-World: Towards High-Quality Visual Question-Visual Answering

17 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: dataset; unified model; knowledge based image editing
TL;DR: This paper studies Visual Question–Visual Answering (VQ-VA) and propose BAGEL-World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction.
Abstract: This paper studies Visual Question–Visual Answering (VQ-VA): generating an image, rather than text, in response to a user’s visual question---an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image. To also bring this capability to open-source models, we introduce BAGEL-World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction. Leveraging web-scale deployment, this pipeline crawls a massive amount of $\sim$1.8M high-quality, interleaved image–text samples for model training. For evaluation, we further release IntelligentBench, a human-curated benchmark that systematically assesses VQ-VA along the aspects of world knowledge, design knowledge and reasoning. Training with BAGEL-World yields strong empirical gains: it helps LightBAGEL attain 45.0 on IntelligentBench, substantially surpassing the best prior open-source baselines (\emph{i.e.}, 6.81@LightBAGEL, 1.94@UniWorld-V1), and significantly narrowing the gap toward leading proprietary systems (\emph{e.g.}, 81.67@NanoBanana, 82.64@GPT- Image). By releasing the full suite of model weights, datasets, and pipelines, we hope it will facilitate future research on VQVA.
Supplementary Material: pdf
Primary Area: datasets and benchmarks
Submission Number: 8802
Loading