Sparse but Sharp: Saliency-guided Black-box Attacks  Reveal Vulnerabilities in Skeleton-based Human  Action Recognition

Jiawei Chu; Xinyu Hong; Youwei Zhou; Tianyang Xu; Xiaojun Wu; Josef Kittler

Sparse but Sharp: Saliency-guided Black-box Attacks Reveal Vulnerabilities in Skeleton-based Human Action Recognition

Jiawei Chu, Xinyu Hong, Youwei Zhou, Tianyang Xu, Xiaojun Wu, Josef Kittler

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Skeleton-based Action Recognition, Black-box Adversarial Attack, Spatiotemporal Sparsity, Saliency-guided Perturbation

TL;DR: We propose a saliency-guided spatio-temporal sparse black-box attack framework that efficiently fools skeleton-based action recognition models under low perturbation, revealing model robustness differences under sparse conditions.

Abstract: The remarkable performance of Graph Convolutional Networks and Transformers in skeleton-based human action recognition is a progress to celebrate. However, recent studies reveal their vulnerability to adversarial attacks.We focus on black-box attacks, which hold greater practical relevance, and propose the first saliency-guided spatio-temporal sparse black-box attack framework for skeleton-based recognition.By estimating the contribution of joints and frame segments to recognition accuracy, our solution is able to inject perturbations only into a localised set, thereby enhancing stealth.Compatible with both confidence-based and label-only black-box settings, our framework offers broad applicability in real-world scenarios.We conduct a comprehensive evaluation of the proposed attack methodology on public large-scale datasets and compare its performance with SOTA algorithms. It is demonstrated that our attack strategy achieves competitive or even superior effectiveness in most settings, while offering better imperceptibility and a favourable balance between query efficiency and attack performance. Importantly, our evaluation reveals significant disparities in robustness across existing action recognition models. Our solution presents a practical paradigm for efficient sparse attack strategies, providing novel insights into the structural robustness of skeleton-based recognition methods.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 23568

Loading