Abstract: Pre-trained language models (PLMs) have demonstrated a remarkable ability to encode factual knowledge. However, the mechanisms underlying how this knowledge is stored and retrieved remain poorly understood, with important implications for AI interpretability and safety. In this paper, we disentangle the multifaceted nature of knowledge: successfully completing a knowledge retrieval task (e.g., “{The capital of France is __”) involves mastering underlying concepts (e.g., France, Paris), relationships between these concepts (e.g., capital of) and the structure of prompts, including the language of the query. We propose to disentangle these distinct aspects of knowledge and apply this typology to offer a critical view of neuron-level knowledge attribution techniques. For concreteness, we focus on Dai et al.'s (2022) Knowledge Neurons (KNs) across multiple PLMs (BERT, OPT, Llama and Gemma), testing 10 natural languages and additional unnatural languages (e.g. Autoprompt).
Our key contributions are twofold: (i) we show that KNs come in different flavors, some indeed encoding entity level concepts, some having a much less transparent, more polysemantic role , and (ii) we address the problem of cross-linguistic knowledge sharing at the neuron level, more specifically we uncover an unprecedented overlap in KNs across up to all of the 10 languages we tested, pointing to the existence of a partially unified, language-agnostic retrieval system. To do so, we introduce and release the Multi-ParaRel dataset, an extension of ParaRel, featuring prompts and paraphrases for cloze-style knowledge retrieval tasks in parallel over 10 languages.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Jeffrey_Pennington1
Submission Number: 4203
Loading