Kernel Mean Embeddings of $\texttt{[CLS]}$ Tokens in ViTs

Published: 13 Nov 2025, Last Modified: 24 Nov 2025TAG-DS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Full Paper (8 pages)
Keywords: Vision transformers, reproducing kernel Hilbert space, maximum mean discrepancy, transformer explainability
TL;DR: A proof-of-concept framework to probe the geometry of $\texttt{[CLS]}$ tokens in ViTs using kernel mean embeddings and MMD.
Abstract: We study the geometry of Vision Transformer (\texttt{ViT}) \texttt{[CLS]} representations across layers through the lens of reproducing kernel Hilbert spaces (RKHS). For each layer and class, we estimate class-conditional kernel mean embeddings (KMEs) and measure separability with maximum mean discrepancy (MMD), tuning the kernel and a per-layer PCA projection via a bootstrap-based signal-to-noise (SNR) objective. We further propose a layer-wise confidence signal by evaluating class mean embeddings along a query’s \texttt{[CLS]} trajectory. On ImageNet-1k subsets, this exploratory, proof-of-concept analysis indicates that the RKHS framework can capture meaningful geometric and semantic signals in \texttt{[CLS]} representations across ViT layers. We make no SOTA claims; our contribution is a unified framework and practical recipe for probing \texttt{[CLS]} geometry.
Supplementary Material: zip
Submission Number: 8
Loading