GRiT: A Generative Region-to-Text Transformer for Object Understanding

Jialian Wu; Jianfeng Wang; Zhengyuan Yang; Zhe Gan; Zicheng Liu; Junsong Yuan; Lijuan Wang

GRiT: A Generative Region-to-Text Transformer for Object Understanding

Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, Lijuan Wang

Published: 01 Jan 2024, Last Modified: 13 Nov 2024ECCV (80) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This paper presents a Generative RegIon-to-Text transformer, GRiT, for object understanding. The spirit of GRiT is to formulate object understanding as <region, text> pairs, where region locates objects and text describes objects. Specifically, GRiT consists of a visual encoder to extract image features, a foreground object extractor to localize objects, and a text decoder to generate natural language for objects. With the same model architecture, GRiT describes objects via not only simple nouns, but also rich descriptive sentences. We define GRiT as open-set object understanding, as it has no limit on object description output from the model architecture perspective. Experimentally, we apply GRiT to dense captioning and object detection tasks. GRiT achieves superior dense captioning performance (15.5 mAP on Visual Genome) and competitive detection accuracy (60.4 AP on COCO test-dev). Code is available at https://github.com/JialianW/GRiT.

Loading