URECA: Unique Region Caption Anything

URECA: Unique Region Caption Anything

ACL ARR 2026 January Submission202 Authors

22 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision Language Model

Abstract: Region captioning models fail to generate descriptions that uniquely identify specific regions of interest, instead producing generic labels that could also apply to other regions within the same image. This ambiguity limits their effectiveness in downstream applications and prevents them from capturing the fine-grained details that distinguish objects. To address this, we introduce the Unique Region Caption Anything (URECA) dataset, a new large-scale benchmark designed to enforce caption uniqueness for multi-granularity regions. URECA dataset is constructed using a novel four-stage automated data pipeline that establishes a one-to-one mapping between a region and a descriptive caption, ensuring that each description uniquely identifies its target. We also propose the URECA model, an architecture built on two innovations for generating unique region captions: a decoupled processing strategy that preserves global context by separating region and image inputs, and dynamic mask modeling to capture fine-grained details regardless of any input image scale. Code and weights will be publicly released.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: cross-modal application, cross-modal information extraction

Contribution Types: Publicly available software and/or pre-trained models, Data resources

Languages Studied: English

Submission Number: 202

Loading