Abstract: Highlights•We design a Siamese transformer to jointly encode paired video frames.•We propose a mixture-attention module to mine inter- and intra-frame relationships.•Our MAST promotes the spatio-temporal learning ability and improves the performance.
External IDs:dblp:journals/artmed/0001YPJXPC025
Loading