GRiT: A Generative Region-to-Text Transformer for Object Understanding

Published: 01 Jan 2024, Last Modified: 13 Nov 2024ECCV (80) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This paper presents a Generative RegIon-to-Text transformer, GRiT, for object understanding. The spirit of GRiT is to formulate object understanding as <region, text> pairs, where region locates objects and text describes objects. Specifically, GRiT consists of a visual encoder to extract image features, a foreground object extractor to localize objects, and a text decoder to generate natural language for objects. With the same model architecture, GRiT describes objects via not only simple nouns, but also rich descriptive sentences. We define GRiT as open-set object understanding, as it has no limit on object description output from the model architecture perspective. Experimentally, we apply GRiT to dense captioning and object detection tasks. GRiT achieves superior dense captioning performance (15.5 mAP on Visual Genome) and competitive detection accuracy (60.4 AP on COCO test-dev). Code is available at https://github.com/JialianW/GRiT.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview