Abstract: We describe an experiment that isolates conceptual semantics by averaging concept activations derived via Sparse Autoencoders. By translating between natural languages and averaging the concept vectors we can mechanistically interpret more accurate meaning from internal states. We apply the experiment to the domain of Ontology Alignment, which seeks to align concepts across different representations of domains. Our results show that improvements occur when averaging the concept activations of English texts and their French and Chinese translations. The trend of improvement correlates to the reduction in symbolic representation from French to Chinese, indicating that the overall process is isolating conceptual semantics by averaging out language specific symbolic representations.
Paper Type: Short
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Mechanistic Interpretability, Sparse Autoencoders, Large Language Models, Ontology Alignment, Conceptual Semantics
Contribution Types: Model analysis & interpretability
Languages Studied: English, French, Chinese
Submission Number: 6839
Loading