Label-Free Privacy-Preserving Learning for Zero-Shot Action Recognition

Label-Free Privacy-Preserving Learning for Zero-Shot Action Recognition

ICLR 2026 Conference Submission16777 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: privacy protection, action recognition, label-free, unsupervised, zero-shot

TL;DR: We introduce LaF-Privacy, a label-free privacy-preserving framework for training anonymizers that preserve action semantics and support zero-shot recognition with VLMs.

Abstract: Traditional action recognition relies on labeled data and closed-set assumptions, limiting adaptability to novel actions and environments. Vision-Language Models (VLMs) offer a more flexible alternative through text-image alignment, enabling zero-shot action recognition. However, using raw video data poses privacy risks due to sensitive visual content. Privacy-Preserving Action Recognition (PPAR) aims to anonymize videos while preserving action-relevant semantics. Existing learning-based PPAR approaches often require both action and privacy annotations and retraining of recognition models on anonymized data, limiting their flexibility and compatibility with powerful pretrained VLMs. We propose LaF-Privacy, a novel label-free privacy-preserving framework for zero-shot action recognition. Our method is trained without any manual annotations, using two complementary objectives: preserving high-level action-relevant features and suppressing low-level appearance cues between raw and anonymized videos. We adopt a video transformer encoder for spatio-temporal learning and introduce an Action-Aware Masking Module (AAMM) to discard irrelevant regions, further enhancing privacy. LaF-Privacy enables direct use of pretrained VLMs for zero-shot inference on anonymized videos. Experiments on VP-UCF101 and VP-HMDB51 demonstrate that our approach achieves state-of-the-art trade-offs between privacy protection and zero-shot recognition performance.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 16777

Loading