URECA: Unique Region Caption Anything

Sangbeom Lim; Junwan Kim; Heeji Yoon; Jaewoo Jung; Seungryong Kim

URECA: Unique Region Caption Anything

Sangbeom Lim, Junwan Kim, Heeji Yoon, Jaewoo Jung, Seungryong Kim

12 Sept 2025 (modified: 27 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Region Captioning, Vision Language Models

TL;DR: Unique region captioning for any region in a image

Abstract: Region captioning models often struggle to generate descriptions unique to a specific area of interest, instead producing generic labels that could also apply to other regions within the same image. This ambiguity limits their effectiveness in downstream applications and prevents them from capturing the fine-grained details that distinguish objects. To address this, we introduce the Unique Region Caption Anything (URECA) dataset, a new large-scale benchmark designed to enforce caption uniqueness for multi-granularity regions. URECA dataset is constructed using a novel four-stage automated data pipeline that establishes a one-to-one mapping between a region and a descriptive caption, ensuring that each description uniquely identifies its target. We also propose the URECA model, an architecture built on two innovations for generating unique region captions: a decoupled processing strategy that preserves global context by separating region and image inputs, and dynamic mask modeling to capture fine-grained details regardless of any input image scale. Code and weights will be publicly released.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 4309

Loading