Keywords: Region Captioning, Vision Language Models
TL;DR: Unique region captioning for any region in a image
Abstract: Region captioning models often struggle to generate descriptions unique to a specific area of interest, instead producing generic labels that could also apply to other regions within the same image. This ambiguity limits their effectiveness in downstream applications and prevents them from capturing the fine-grained details that distinguish objects. To address this, we introduce the Unique Region Caption Anything (URECA) dataset, a new large-scale benchmark designed to enforce caption uniqueness for multi-granularity regions. URECA dataset is constructed using a novel four-stage automated data pipeline that establishes a one-to-one mapping between a region and a descriptive caption, ensuring that each description uniquely identifies its target. We also propose the URECA model, an architecture built on two innovations for generating unique region captions: a decoupled processing strategy that preserves global context by separating region and image inputs, and dynamic mask modeling to capture fine-grained details regardless of any input image scale.
Code and weights will be publicly released.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4309
Loading