Abstract: Highlights•A novel transformer-based dense captioning framework RelFormer.•Encoding the relations among stuff and objects in the image.•Using CLIP to extract the multi-modal information of stuff.
External IDs:dblp:journals/cviu/JinQSZW25
Loading