Abstract: Highlights•A novel, efficient pose-guided multimodal network is proposed for action recognition.•The eXpand temporal Shift model is introduced to rival 3D CNNs (X3D) with fewer GFLOPs.•A pose attention block is proposed to guide RGB stream to keyframes and key body regions.•Our multimodal net rivals SoTA on 4 datasets, reducing FLOPs/parameters by 72.8x/48.6x.
External IDs:dblp:journals/ijon/AbdelkawyAF25
Loading