Abstract: Highlights•Proposing an asymmetric encoder-decoder visual transformer network, called TSVT.•TSVT can adaptively sparse tokens for effectively exploring global context.•An IDFM is designed to fuse the difference and consistency of multi-modality tokens.•TSVT achieves a more robust and effective saliency detection performance.
Loading