Keywords: NLP, interpretability, multilingual, individual neurons, probing
Abstract: While many studies have shown that linguistic information is encoded in hidden word representations, few have studied individual neurons, to show how and in which neurons it is encoded. Among these, the common approach is to use an external probe to rank neurons according to their relevance to some linguistic attribute, and to evaluate the obtained ranking using the same probe that produced it. We show that this methodology confounds distinct factors---probe quality and ranking quality---and thus we separate them. We compare two recent ranking methods and a novel one we introduce, both by probing and by causal interventions, where we modify the representations and observe the effect on the model's output. We show that encoded information and used information are not always the same, and that individual neurons can be used to control the model's output, to some extent. Our method can be used to identify how certain information is encoded, and how to manipulate it for debugging purposes.