Keywords: Information Decoding, Vector Symbolic Architectures, Hyperdimensional Computing, LLMs, Probing, Residual Stream, Interpretability, Neurosymbolic, Neural representations, Concept extraction
TL;DR: This work combines symbolic representations and neural probing to introduce Hyperdimensional Probe, a new paradigm for decoding LLM vector space into human-interpretable features, consistently extracting meaningful concepts across models and inputs
Abstract: Despite their capabilities, Large Language Models (LLMs) remain opaque with limited understanding of their internal representations.
Current interpretability methods either focus on input-oriented feature extraction, such as supervised probes and Sparse Autoencoders (SAEs), or on output distribution inspection, such as logit-oriented approaches.
A full understanding of LLM vector spaces, however, requires integrating both perspectives, something existing approaches struggle with due to constraints on latent feature definitions.
We introduce the *Hyperdimensional Probe*, a hybrid supervised probe that combines symbolic representations with neural probing.
Leveraging Vector Symbolic Architectures (VSAs) and hypervector algebra, it unifies prior methods: the top-down interpretability of supervised probes, SAE’s sparsity-driven proxy space, and output-oriented logit investigation.
This allows deeper input-focused feature extraction while supporting output-oriented investigation.
Our experiments demonstrate that our method consistently extracts meaningful concepts across different LLMs, embedding sizes, and setups; uncovering concept-driven patterns in analogy-oriented inference and QA-focused text generation.
By supporting joint input–output analysis, this work advances semantic understanding of neural representations while unifying the complementary perspectives of prior methods.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 12823
Loading