Abstract: Highlights•An end-to-end RVOS framework named Fully Transformer-Equipped Architecture (FTEA) is developed completely upon transformers.•The stacked attention module captures the object-level spatial context, and the stacked Feed-Forward Network reduces the model parameters.•The diversity loss is imposed on the candidate object kernels for diversifying candidate object masks.•Experimental results on A2D Sentences, J-HMDB Sentences, and Ref-YouTube-VOS, verifies the advantages of the proposed approach.
Loading