On Connection between $\texttt{CLS}$ Token and Virtual Node: Are they both sides of the same coin?

On Connection between $\texttt{CLS}$ Token and Virtual Node: Are they both sides of the same coin?

14 Oct 2025 (modified: 21 Nov 2025)Withdrawn by AuthorsEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Transformers emerged as a promising tool in learning long-range dependencies in language tasks. Recently, it has been shown that the self-attention module in the Transformer operates on the complete graph of tokens obtained from the input dataset. Specially, in language and vision-related tasks, a \texttt{CLS} token is added at the beginning of the token sequence to extract the representation of the entire sequence. On the other hand, a Virtual Node (VN) is added in a graph to mitigate the effect of Oversquashing, an information bottleneck typically observed in Graph Neural Networks (GNNs). In this work, we observe that both \texttt{CLS} token and VN are structurally identical in their respective graphs. Both \texttt{CLS} token and VN are connected to every other token or node and aggregate features from all other remaining tokens or nodes in a similar way. Although the embedding of the \texttt{CLS} token is used to classify the input sequence, the embedding of VN is not fully explored; instead, standard pooling is employed to extract the graph-level representation for classification. Thus, to bridge the gap, we consider representations of VNs to solve classification tasks on standard graph datasets and observe surprisingly competitive performances.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: N/A

Assigned Action Editor: ~Vicenç_Gómez1

Submission Number: 6210

Loading