Keywords: mechanistic interpretability, LLMs, attention heads, in-context learning, concept invariance
TL;DR: Concept Vectors in LLMs contain abstract concept representations, but they differ from Function Vectors that drive ICL performance.
Abstract: Do large language models (LLMs) represent concepts abstractly, i.e., independent of input format? We revisit Function Vectors (FVs), compact representations of in-context learning (ICL) tasks that causally drive task performance. Across multiple LLMs, we show that FVs are not fully invariant: FVs are nearly orthogonal when extracted from different input formats (e.g., open-ended vs. multiple-choice), even if both target the same concept. We introduce Concept Vectors (CVs) which produce more stable concept representations. Like FVs, CVs are composed of attention head outputs; however, unlike FVs, the head selection is optimized via Representational Similarity Analysis (RSA) to encode concepts consistently across input formats. While these heads emerge in similar layers to FV-related heads, the two sets are largely distinct, suggesting different underlying mechanisms. Steering experiments reveal that FVs excel in-distribution, when extraction and application formats match (e.g., both open-ended in English), while CVs generalize better out-of-distribution across both question types (open-ended vs. multiple-choice) and languages. Our results show that LLMs do contain abstract concept representations, but these differ from those that drive ICL performance.
Primary Area: interpretability and explainable AI
Submission Number: 20082
Loading