RelFormer: Advancing contextual relations for transformer-based dense captioning

Published: 01 Jan 2025, Last Modified: 11 Oct 2025Comput. Vis. Image Underst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•A novel transformer-based dense captioning framework RelFormer.•Encoding the relations among stuff and objects in the image.•Using CLIP to extract the multi-modal information of stuff.
Loading