TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection

Qihang Zhou; Bin-Bin Gao; Guansong Pang; Xin Wang; Jiming Chen; Shibo He

TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection

Qihang Zhou, Bin-Bin Gao, Guansong Pang, Xin Wang, Jiming Chen, Shibo He

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: zero-shot anomaly detection, industrial anomaly detection

TL;DR: This paper proposes a token-wise adaptation framework that enables dynamic alignment for zero-shot anomaly detection

Abstract: Adapting CLIP for anomaly detection on unseen objects has shown strong potential in a zero-shot manner. However, existing methods typically rely on a single textual space to align with visual semantics across diverse objects and domains. The indiscriminate alignment hinders the model from accurately capturing varied anomaly semantics. We propose TokenCLIP, a token-wise adaptation framework that enables dynamic alignment for fine-grained anomaly learning. Rather than mapping all visual tokens to a single, token-agnostic textual space, TokenCLIP aligns each token with a customized textual subspace that represents its visual characteristics. Explicitly assigning a unique learnable textual space to each token is computationally intractable and prone to insufficient optimization. We instead expand the token-agnostic textual space into a set of orthogonal subspaces, and then dynamically assign each token to a subspace combination guided by semantic affinity, which jointly supports customized and efficient token-wise adaptation. To this end, we formulate dynamic alignment as an optimal transport problem, where all visual tokens in an image are transported to textual subspaces under the cross-modal similarity cost matrix. The marginal constraint and minimal cost objective of OT ensure sufficient optimization across subspaces and encourage them to focus on different semantics. Solving the problem yields a transport plan that adaptively assigns each token to semantically relevant subspaces. A top-k masking is applied to further sparsify the plan and specialize subspaces for distinct visual regions. Extensive experiments demonstrate the superiority of TokenCLIP.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 5815

Loading