Can We Statically Locate Knowledge in Large Language Models? Financial Domain and Toxicity Reduction Case Studies

Published: 21 Sept 2024, Last Modified: 06 Oct 2024BlackboxNLP 2024EveryoneRevisionsBibTeXCC BY 4.0
Track: Full paper
Keywords: Interpretability, Transformer, Large Language Models, Embedding similarity
TL;DR: We directly search over Transformer parameters with embedding similarity given user's semantic queries to investigate the potential of static weight location in real-world scenarios.
Abstract: Current large language model (LLM) evaluations rely on benchmarks to assess model capabilities and their encoded knowledge. However, these evaluations cannot reveal where a model encodes its knowledge, and thus little is known about which weights contain specific information. We propose a method to statically (without forward or backward passes) locate topical knowledge in the weight space of an LLM, building on a prior insight that parameters can be decoded into interpretable tokens. If parameters can be mapped into the embedding space, it should be possible to directly search for knowledge via embedding similarity. We study the validity of this assumption across several LLMs for a variety of concepts in the financial domain and a toxicity detection setup. Our analysis yields an improved understanding of the promises and limitations of static knowledge location in real-world scenarios.
Copyright PDF: pdf
Submission Number: 31
Loading