Fully Transformer-Equipped Architecture for end-to-end Referring Video Object Segmentation

Ping Li, Yu Zhang, Li Yuan, Xianghua Xu

Published: 01 Jan 2024, Last Modified: 13 Nov 2024Inf. Process. Manag. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Highlights•An end-to-end RVOS framework named Fully Transformer-Equipped Architecture (FTEA) is developed completely upon transformers.•The stacked attention module captures the object-level spatial context, and the stacked Feed-Forward Network reduces the model parameters.•The diversity loss is imposed on the candidate object kernels for diversifying candidate object masks.•Experimental results on A2D Sentences, J-HMDB Sentences, and Ref-YouTube-VOS, verifies the advantages of the proposed approach.