Bagel-World: Towards High-Quality Visual Question-Visual Answering

Chenhui Gou; Zilong Chen; Zeyu Wang; Feng Li; Deyao Zhu; Zicheng Duan; Kunchang Li; Chaorui Deng; Hongyi Yuan; Cihang Xie; Jianfei Cai; Hamid Rezatofighi

Bagel-World: Towards High-Quality Visual Question-Visual Answering

Chenhui Gou, Zilong Chen, Zeyu Wang, Feng Li, Deyao Zhu, Zicheng Duan, Kunchang Li, Chaorui Deng, Hongyi Yuan, Cihang Xie, Jianfei Cai, Hamid Rezatofighi

17 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: dataset; unified model; knowledge based image editing

TL;DR: This paper studies Visual Question–Visual Answering (VQ-VA) and propose BAGEL-World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction.

Abstract: This paper studies Visual Question–Visual Answering (VQ-VA): generating an image, rather than text, in response to a user’s visual question---an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image. To also bring this capability to open-source models, we introduce BAGEL-World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction. Leveraging web-scale deployment, this pipeline crawls a massive amount of $\sim$1.8M high-quality, interleaved image–text samples for model training. For evaluation, we further release IntelligentBench, a human-curated benchmark that systematically assesses VQ-VA along the aspects of world knowledge, design knowledge and reasoning. Training with BAGEL-World yields strong empirical gains: it helps LightBAGEL attain 45.0 on IntelligentBench, substantially surpassing the best prior open-source baselines (\emph{i.e.}, 6.81@LightBAGEL, 1.94@UniWorld-V1), and significantly narrowing the gap toward leading proprietary systems (\emph{e.g.}, 81.67@NanoBanana, 82.64@GPT- Image). By releasing the full suite of model weights, datasets, and pipelines, we hope it will facilitate future research on VQVA.

Supplementary Material: pdf

Primary Area: datasets and benchmarks

Submission Number: 8802

Loading