paper,affiliation,non-academia,publication date (first revision),link,model family,model,param count,other models studied,mmlu_accuracy,arc_accuracy,hellaswag_accuracy,notes
"A. Arditi, O. Balcells, A. Syed, W. Gurnee, and N. Nanda. Refusal in llms is mediated by a single direction.",MATS,FALSE,2024-06-17,https://arxiv.org/abs/2406.11717,QWEN Chat ,QWEN Chat 72B,72000000000,"Qwen chat 1.8B, 7B, 14B, 72B
Gemma instruction-tuned 2B, 7B
Yi chat 6B, 34B
Llama-3 instruct 8B, 70B",77.48,68.26,86.47,
"W. Wu, Y. Wang, G. Xiao, H. Peng, and Y. Fu. Retrieval head mechanistically explains long-context factuality.","Peking University, University of Washington, MIT, UIUC, University of Edinburgh",FALSE,2024-04-24,https://arxiv.org/abs/2404.15574,Yi,Yi-34B,34000000000,"Llama2-7b, Llama2-13B, mistral-7B, mixtral-8x7B, Yi-6B, Yi-34B, Qwen1.5-14B",76.35,64.59,85.69,
"A. Madsen, S. Chandar, and S. Reddy. Are self-explanations from large language models faithful? ArXiv,
abs/2401.07927.",Mila – Quebec AI Institute 2 Polytechnique Montréal 3 McGill University 4 Canada CIFAR AI Chair 5 Facebook CIFAR AI Chair,TRUE,2024-01-15,https://arxiv.org/abs/2401.07927,Llama-2,Llama-2 70B,70000000000,"Llama2 (7B), Falcon (40B, 7B),
and Mistral (7B)",69.83,67.32,86.315,
"A. Variengien and E. Winsor. Look before you leap: A universal emergent decomposition of retrieval tasks in
language models.","École Normale Supérieure de Lyon,
École Polytechnique Fédérale de Lausanne, Conjecture",TRUE,2023-12-13,https://arxiv.org/abs/2312.10091,Llama-2,Llama-2 70B,70000000000,"GPT-2, Pythia, falcon and Llama-2 in all available sizes",69.83,67.32,86.315,
"A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dombrowski,
S. Goel, N. Li, M. J. Byun, Z. Wang, A. Mallen, S. Basart, S. Koyejo, D. Song, M. Fredrikson, J. Z. Kolter,
and D. Hendrycks. Representation engineering: A top-down approach to ai transparency.","1Center for AI Safety
2Carnegie Mellon University
3UC Berkeley
4Stanford University
5EleutherAI
6University of Maryland
7Cornell University
8University of Pennsylvania
9University of Illinois Urbana-Champaign",TRUE,2023-10-2,https://arxiv.org/abs/2310.01405,Llama-2,Llama-2 70B,70000000000,,69.83,67.32,86.315,
"E. Todd, M. Li, A. S. Sharma, A. Mueller, B. C. Wallace, and D. Bau. LLMs represent contextual tasks as
compact function vectors. ",Northeastern,FALSE,2023-10-23,https://arxiv.org/abs/2310.15213,Llama-2,Llama-2 70B,70000000000,"Llama-2 13B and 7B, GPT-J (6B), GPT-NeoX (20B)",69.83,67.32,86.315,
"G. Monea, M. Peyrard, M. Josifoski, V. Chaudhary, J. Eisner, E. Kıcıman, H. Palangi, B. Patra, and R. West.
A glitch in the matrix? locating and detecting language model grounding with fakepedia.","EPFL, Microsoft, Univ. Grenoble Alpes",TRUE,2023-12-04,https://arxiv.org/abs/2312.02073,Llama-2,Llama-2 70B,70000000000,,69.83,67.32,86.315,
"G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis. Efficient streaming language models with attention sinks.
Arxiv.","MIT, Meta, CMU, NVIDIA",TRUE,2023-09-29,https://arxiv.org/abs/2309.17453,Llama-2,Llama-2 70B,70000000000,"Llama-2-[7,13,70]B, Falcon-[7,40]B, Pythia-[2.8,6.9,12]B, and MPT-[7,30]B",69.83,67.32,86.315,
"H. Chen, C. Vondrick, and C. Mao. Selfie: Self-interpretation of large language model embeddings.",Columbia,FALSE,2024-03-16,https://arxiv.org/abs/2403.10949,Llama-2,Llama-2 70B,70000000000,,69.83,67.32,86.315,
"M. Yuksekgonul, V. Chandrasekaran, E. Jones, S. Gunasekar, R. Naik, H. Palangi, E. Kamar, and B. Nushi.
Attention satisfies: A constraint-satisfaction lens on factual errors of language models. ","Stanford, UIUC, UC Berkeley, Microsoft",TRUE,2023-09-26,https://arxiv.org/abs/2309.15098,Llama-2,Llama-2 70B,70000000000,"Llama2-7B, Llama2-13B, Llama2-70B",69.83,67.32,86.315,
"N. Y. Siegel, O.-M. Camburu, N. Heess, and M. Perez-Ortiz. The probabilities also matter: A more faithful
metric for faithfulness of free-text explanations in large language models.","Google Deepmind, University College London",TRUE,2024-04-04,https://arxiv.org/abs/2404.03189,Llama-2,Llama-2 70B,70000000000,Llama2-7b Llama2-13b Llama2-70b,69.83,67.32,86.315,
"S. Chen, M. Xiong, J. Liu, Z. Wu, T. Xiao, S. Gao, and J. He. In-context sharpness as alerts: An inner representation perspective for hallucination mitigation.","Stanford, Penn State, City University of Hon Kong, National University of Singapore, Shanghai Jiao Tong University, Hong Kong University of Science and Technology",FALSE,2024-03-03,https://arxiv.org/abs/2403.01548,Llama-2,Llama-2 70B,70000000000,Llama-2 models (chat),69.83,67.32,86.315,
S. Marks and M. Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.,"Northeastern, MIT",FALSE,2023-10-10,https://arxiv.org/abs/2310.06824,Llama-2,Llama-2 70B,70000000000,Used both Llama-2-13B and Llama-2-70B,69.83,67.32,86.315,
"T. Tang, W. Luo, H. Huang, D. Zhang, X. Wang, X. Zhao, F. Wei, and J.-R. Wen. Language-specific neurons:
The key to multilingual capabilities in large language models.","Gaoling School of Artificial Intelligence, Renmin University of China,  Microsoft Research Asia",TRUE,2024-02-26,https://arxiv.org/abs/2402.16438,Llama-2,Llama-2 70B,70000000000,"Llama-2 (70E10), Llama-2 (7E9), Llama-2 (1.3E10), BLOOM (7E9), OPT (6.7E9) Mistral (7E9), Phi-2 (2.7E9)",69.83,67.32,86.315,
W. Gurnee and M. Tegmark. Language models represent space and time.,MIT,FALSE,2023-10-03,https://arxiv.org/abs/2310.02207,Llama-2,Llama-2 70B,70000000000,Llama-2-70B,69.83,67.32,86.315,
"Q. Wang, T. Anikina, N. Feldhus, J. van Genabith, L. Hennig, and S. Möller. Llmcheckup: Conversational
examination of large language models via interpretability tools.","German Research Center for Artificial Intelligence,  Technische Universität Berlin, Saarland Informatics Campus",FALSE,2024-01-23,https://arxiv.org/abs/2401.12576,Llama-2,Stable Beluga 2 70B,70000000000,Stable Beluga 2-70B,68.79,71.08,86.37,
"A. M. Turner, L. Thiergart, D. Udell, G. Leech, U. Mini, and M. MacDiarmid. Activation addition: Steering
language models without optimization.","DeepMind, MIRI, University of Bristol, Arb Research, MATS, Anthropic",TRUE,2023-08-20,https://arxiv.org/abs/2308.10248,Llama-3,Llama-3 8B,8000000000,"OPT, GPT-2-xl 1.5b , and GPT-J",66.7,60.24,82.23,
"Y.-S. Chuang, Y. Xie, H. Luo, Y. Kim, J. R. Glass, and P. He. Dola: Decoding by contrasting layers improves
factuality in large language models. ","Massachusetts Institute of Technology, Microsoft",FALSE,2023-09-07,https://arxiv.org/abs/2309.03883,Llama,Llama 65B,65000000000,"Llama-7B, Llama-13B, Llama-33B, Llama-65B. 65B used for only one figure",66.15,0,0,
"S. Rajamanoharan, A. Conmy, L. Smith, T. Lieberum, V. Varma, J. Kramár, R. Shah, and N. Nanda. Improving
dictionary learning with gated sparse autoencoders. ",google deepmind,TRUE,2024-04-24,https://arxiv.org/abs/2404.16014,gemma,Gemma-7B,7000000000,also used Pythia-2.8B and GELU-1l (3.1M),66.03,61.09,82.47,
"K. Amara, R. Sevastjanova, and M. El-Assady. Syntaxshap: Syntax-aware explainability method for text
generation.",ETH Zurich,FALSE,2024-02-14,https://arxiv.org/abs/2402.09259,Mistral,Mistral 7B,7000000000,GPT2-medium,62.98,57.995,82.145,
"M. Avitan, R. Cotterell, Y. Goldberg, and S. Ravfogel. What changed? converting representational interventions
to natural language.","Bar-Ilan University, ETH Zurich, Allen Institute for AI",TRUE,2024-02-17,https://arxiv.org/abs/2402.11355,Mistral,Mistral 7B,7000000000,Mistral7b and GPT2,62.98,57.995,82.145,
"R. Hendel, M. Geva, and A. Globerson. In-context learning creates task vectors.","Tel Aviv, Google Deepmind",TRUE,2023-10-24,https://arxiv.org/abs/2310.15916,Llama,Llama 30B,30000000000,Llama 30B,57.8,57.8,82.8,
"N. Varshney, W. Yao, H. Zhang, J. Chen, and D. Yu. A stitch in time saves nine: Detecting and mitigating
hallucinations of llms by validating low-confidence generation.","Arizona State University, Tencent AI Lab",TRUE,2023-07-08,https://arxiv.org/abs/2307.03987,Llama-2,Vicuna-13b,13000000000,"GPT 3.5, Vicuna-13B",56.67,57.08,81.24,
"A. Ghandeharioun, A. Caciularu, A. Pearce, L. Dixon, and M. Geva. Patchscopes: A unifying framework for
inspecting hidden representations of language models.","Google Research, Tel Aviv University",TRUE,2024-01-11,https://arxiv.org/abs/2401.06102,Llama-2,Llama-2 13B,13000000000,"Vicuna 13b, GPT-J 6b and Pythia 12b",55.285,59.39,81.415,
"M. Sun, X. Chen, J. Z. Kolter, and Z. Liu. Massive activations in large language models.","Carnegie Mellon University, Meta AI Research, Bosch Center for AI",TRUE,2024-02-27,https://arxiv.org/abs/2402.17762,Llama-2,Llama-2 13B,13000000000,"Llama2-7B, Llama2-13B, Phi2, Mistral-8x7B, CLIP ViT-L, DINOv2 ViT-L MAE ViT-L",55.285,59.39,81.415,
"S. CH-Wang, B. V. Durme, J. Eisner, and C. Kedzie. Do androids know they’re only dreaming of electric
sheep?.",Microsoft,TRUE,2023-12-28,https://arxiv.org/abs/2312.17249v1,Llama-2,Llama-2 13B,13000000000,Llama2-13b,55.285,59.39,81.415,
"Y. Kwon, E. Wu, K. Wu, and J. Zou. Datainf: Efficiently estimating data influence in loRA-tuned LLMs
and diffusion models.","columbia, stanford",FALSE,2023-10-02,https://arxiv.org/abs/2310.00902,Llama-2,Llama-2 13B,13000000000,"RoBERTa-large, Llama-2-13B-chat, stable-diffusion-v-1.5",55.285,59.39,81.415,
"A. Langedijk, H. Mohebbi, G. Sarti, W. Zuidema, and J. Jumelet. Decoderlens: Layerwise interpretation of
encoder-decoder transformers.","University of Amsterdam, Tilburg University, University of Groningen",FALSE,2023-10-05,https://arxiv.org/abs/2310.03686,flan-t5,flan-t5-xxl,11300000000," custom small transformer, nllb-600m and whisper",55.1,0,0,
"Z. Wu, A. Arora, Z. Wang, A. Geiger, D. Jurafsky, C. D. Manning, and C. Potts. Reft: Representation finetuning
for language models.","Stanford, Pr(AI)^2R Group",FALSE,2024-04-04,https://arxiv.org/abs/2404.03592,Llama,Llama 13B,13000000000,"Llama-7B, Llama-13B, Llama-2-7B, Llama-3-8B, RoBERTa-base, RoBERTa-large",46.9,52.7,79.2,
A. Azaria and T. Mitchell. The internal state of an LLM knows when it’s lying.,"Ariel University, Carnegie Mellon University",FALSE,2023-04-26,https://arxiv.org/abs/2304.13734,Llama-2,Llama-2 7B,7000000000,OPT-6.7b,46.085,53.07,77.895,
"A. Gupta, D. Sajnani, and G. Anumanchipalli. A unified framework for model editing.",UC Berkeley,FALSE,2024-03-21,https://arxiv.org/abs/2403.14236,Llama-2,Llama-2 7B,7000000000,GPT2-xl 1.5b and GPT-J 6b,46.085,53.07,77.895,
"A. Lv, K. Zhang, Y. Chen, Y. Wang, L. Liu, J.-R. Wen, J. Xie, and R. Yan. Interpreting key mechanisms of
factual recall in transformer-based language models.","Gaoling School of Artificial Intelligence, Renmin University of China 2 XiaoMi AI Lab 3 Baichuan Inc",TRUE,2024-03-28,https://arxiv.org/abs/2403.19521,Llama-2,Llama-2 7B,7000000000,"all sizes of GPT-2, 1.3b OPT",46.085,53.07,77.895,
"J. Huang, Z. Wu, C. Potts, M. Geva, and A. Geiger. Ravel: Evaluating interpretability methods on disentangling
language model representations.","Stanford, Tel Aviv, PRAiR",FALSE,2024-02-27,https://arxiv.org/abs/2402.17700,Llama-2,Llama-2 7B,7000000000,,46.085,53.07,77.895,
"J.-C. Gu, H. Xu, J.-Y. Ma, P. Lu, Z.-H. Ling, K. wei Chang, and N. Peng. Model editing can hurt general
abilities of large language models. ","UCLA, University of Science and Technology of China",FALSE,2024-01-09,https://arxiv.org/abs/2401.04700,Llama-2,Llama-2 7B,7000000000,GPT2-xl,46.085,53.07,77.895,
"K. Park, Y. J. Choe, and V. Veitch. The linear representation hypothesis and the geometry of large language
models.",University of Chicago,FALSE,2023-11-07,https://arxiv.org/abs/2311.03658,Llama-2,Llama-2 7B,7000000000,,46.085,53.07,77.895,
"N. Prakash, T. R. Shaham, T. Haklay, Y. Belinkov, and D. Bau. Fine-tuning enhances existing mechanisms: A
case study on entity tracking. ","Northeastern, MIT, Technion",FALSE,2024-02-22,https://arxiv.org/abs/2402.14811,Llama-2,Llama-2 7B,7000000000,"Llama-7b, Vicuna 7B, Goat-7B, Float-7B",46.085,53.07,77.895,
"P. Sharma, J. T. Ash, and D. Misra. The truth is in there: Improving reasoning with layer-selective rank
reduction. ","MIT, Microsoft Research NYC",TRUE,2023-12-21,https://arxiv.org/abs/2312.13558,Llama-2,Llama-2 7B,7000000000,"roBERTa, GPTj, Llama2",46.085,53.07,77.895,
"R. Achtibat, S. M. V. Hatefi, M. Dreyer, A. Jain, T. Wiegand, S. Lapuschkin, and W. Samek. AttnLRP:
Attention-aware layer-wise relevance propagation for transformers.","Fraunhofer Heinrich-Hertz-Institute, Technische Universita ̈t Berlin, BIFOLD – Berlin Institute for the Foundations of Learning and Data",FALSE,2024-02-08,https://arxiv.org/abs/2402.05602,Llama-2,Llama-2 7B,7000000000,"Llama-2-7b, mistral 7b",46.085,53.07,77.895,
"R. Gould, E. Ong, G. Ogden, and A. Conmy. Successor heads: Recurring, interpretable attention heads in the
wild. ","University of Cambridge, Independent",FALSE,2023-12-14,https://arxiv.org/abs/2312.09230,Llama-2,Llama-2 7B,7000000000,"Pythia 1.4b, GPT2xl, Llama2 7b",46.085,53.07,77.895,
"S. Katz, Y. Belinkov, M. Geva, and L. Wolf. Backward lens: Projecting language model gradients into the
vocabulary space.","Technion, Tel Aviv University",FALSE,2024-02-20,https://arxiv.org/abs/2402.12865,Llama-2,Llama-2 7B,7000000000,"GPT2-Small, GPT2-Medium, GPT2-XL",46.085,53.07,77.895,
"S. Singh, S. Ravfogel, J. Herzig, R. Aharoni, R. Cotterell, and P. Kumaraguru. Mimic: Minimally modified
counterfactuals in the representation space.","IIIT Hyderabad, Bar-Ilan University, Google Research, ETH Zurich",TRUE,2024-02-15,https://arxiv.org/abs/2402.09631,Llama-2,Llama-2 7B,7000000000,,46.085,53.07,77.895,
"Y. Jiang, G. Rajendran, P. Ravikumar, B. Aragam, and V. Veitch. On the origins of linear representations in
large language models","university of chicago, carnegie mellon",FALSE,2024-03-06,https://arxiv.org/abs/2403.03867,Llama-2,Llama-2 7B,7000000000,,46.085,53.07,77.895,
"A. Stolfo, Y. Belinkov, and M. Sachan. A mechanistic interpretation of arithmetic reasoning in language
models using causal mediation analysis.","ETH Zürich, Technion – IIT, Israel",FALSE,2023-05-24,https://arxiv.org/abs/2305.15054,Llama,Llama 7B,7000000000,GPT-J 6b and Pythia 2.8b,35.1,47.6,76.1,
"K. Li, O. Patel, F. Viégas, H. Pfister, and M. Wattenberg. Inference-time intervention: Eliciting truthful answers
from a language model. ",Harvard University,FALSE,2023-06-06,https://arxiv.org/abs/2306.03341,Llama,Llama 7B,7000000000,,35.1,47.6,76.1,
"Z. Yu and S. Ananiadou. Locating factual knowledge in large language models: Exploring the residual stream
and analyzing subvalues in vocabulary space.",University of Manchester,FALSE,2023-12-19,https://arxiv.org/abs/2312.12141,Llama,Llama 7B,7000000000,"GPT2-L, Llama-7B",35.1,47.6,76.1,
"C. Chen, K. Liu, Z. Chen, Y. Gu, Y. Wu, M. Tao, Z. Fu, and J. Ye. INSIDE: LLMs’ internal states retain the
power of hallucination detection. ","Alibaba Cloud, Zhejiang University",FALSE,2024-02-06,https://arxiv.org/abs/2402.03744,Llama,Llama 7B,13000000000,Llama 7b,35.1,47.6,76.1,
"J. Merullo, C. Eickhoff, and E. Pavlick. A mechanism for solving relational tasks in transformer language
models.",Brown University,FALSE,2023-05-25,https://arxiv.org/abs/2305.16130,bloom,BLOOM (176B),176000000000,"GPT-2 (all variants), GPT-J",34.975,50.64,74.805,
"E. Voita, J. Ferrando, and C. Nalmpantis. Neurons in large language models: Dead, n-gram, positional.","Meta AI, Universitat Politècnica de Catalunya*",TRUE,2023-09-09,https://arxiv.org/abs/2309.04827,OPT,OPT 66B,66000000000,all OPT sizes,31.495,46.33,74.875,
"G. Paulo, T. Marshall, and N. Belrose. Does transformer interpretability transfer to rnns?",EleutherAI,TRUE,2024-04-09,https://arxiv.org/abs/2404.05971,RWKV v5,RWKV v5,7000000000,,31,0,0,
"B. Chughtai, A. Cooney, and N. Nanda. Summing up the facts: Additive mechanisms behind factual recall in
llms.",Independent,FALSE,2024-02-11,https://arxiv.org/abs/2402.07321,GPT-2,GPT-2 XL,1500000000,"Pythia 2.8b, GPT-J",29.47,30.29,51.13,
B. Millidge and E. Winsor. Basic facts about language model internals. ,Conjecture,TRUE,2023-01-04,https://www.alignmentforum.org/posts/PDLfpRwSynu73mxGw/basic-facts-about-language-model-internals-1,GPT-2,GPT-2 XL,1500000000,,29.47,30.29,51.13,
"F. Zhang and N. Nanda. Towards best practices of activation patching in language models: Metrics and methods.
",UC Berkeley,FALSE,2023-09-27,https://arxiv.org/abs/2309.16042,GPT-2,GPT-2 XL,1500000000,GPT-small,29.47,30.29,51.13,
"G. Kobayashi, T. Kuribayashi, S. Yokoi, and K. Inui. Transformer language models handle word frequency
in prediction head.","Tohoku University, MBZUAI, RIKEN",TRUE,2023-05-23,https://arxiv.org/abs/2305.18294,GPT-2,GPT-2 XL,1500000000,"GPT-2-small to -large, BERT-base and -large",29.47,30.29,51.13,
"J. Ferrando, G. I. Gállego, I. Tsiamas, and M. R. Costa-jussà. Explaining how transformers use context to
build predictions. ","Universitat Politècnica de Catalunya, Meta AI",TRUE,2023-05-21,https://arxiv.org/abs/2305.12535,GPT-2,GPT-2 XL,1500000000,"GPT-2 XL (1.5B) model (Radford et al., 2019), as in (Yin and Neubig, 2022), as well as other autoregressive Transformer language mod- els, such as GPT-2 Small (124M), and GPT-2 Large models (774M), OPT 125M (Zhang et al., 2022b), and BLOOM’s 560M and 1.1B variants",29.47,30.29,51.13,
"J. Huang, A. Geiger, K. D’Oosterlinck, Z. Wu, and C. Potts. Rigorously assessing natural language explanations
of neurons.","Stanford University, Pr(AI)²R Group, Ghent University",FALSE,2023-09-19,https://arxiv.org/abs/2309.10312,GPT-2,GPT-2 XL,1500000000,,29.47,30.29,51.13,
"J. Vig, S. Gehrmann, Y. Belinkov, S. Qian, D. Nevo, Y. Singer, and S. Shieber. Investigating gender bias in
language models using causal mediation analysis.","Salesforce Research, Harvard University, Tel Aviv University",TRUE,2020-12-6,https://dl.acm.org/doi/10.5555/3495724.3496763,GPT-2,GPT-2 XL,1500000000,"GPT2 small, medium, large, extra-large",29.47,30.29,51.13,
"K. Meng, D. Bau, A. Andonian, and Y. Belinkov. Locating and editing factual associations in GPT.","Northeastern, MIT, Technion",FALSE,2022-02-10,https://arxiv.org/abs/2202.05262,GPT-2,GPT-2 XL,1500000000,GPT-2 XL (1.5B parameters),29.47,30.29,51.13,
"N. Goldowsky-Dill, C. MacLeod, L. Sato, and A. Arora. Localizing model behavior with path patching.",Redwood Research,TRUE,2023-04-12,https://arxiv.org/abs/2304.05969,GPT-2,GPT-2 XL,1500000000,"GPT-2 small, GPT2-XL",29.47,30.29,51.13,
"Q. Yu, J. Merullo, and E. Pavlick. Characterizing mechanisms for factual recall in language models.",Brown University,FALSE,2023-10-24,https://arxiv.org/abs/2310.15910,GPT-2,GPT-2 XL,1500000000,"Pythia, GPT2",29.47,30.29,51.13,
S. Heimersheim and A. Turner. Residual stream norms grow exponentially over the forward pass.,Independent,FALSE,2023-05-06,https://www.alignmentforum.org/posts/8mizBCm3dyc432nK8/residual-stream-norms-grow-exponentially-over-the-forward,GPT-2,GPT-2 XL,1500000000,"GPT2-small, GPT2-xl",29.47,30.29,51.13,
S. Rajamanoharan. Progress update 1 from the gdm mech interp team. improving ghost grads.,Google Deepmind,TRUE,2024-04-19,https://www.alignmentforum.org/posts/C5KAZQib3bzzpeyrg/progress-update-1-from-the-gdm-mech-interp-team-full-update,GPT-2,GPT-2 XL,1500000000,,29.47,30.29,51.13,
"W. Gurnee, T. Horsley, Z. C. Guo, T. R. Kheirkhah, Q. Sun, W. Hathaway, N. Nanda, and D. BERTsimas.
Universal neurons in GPT2 language models.","MIT, University of Cambridge",FALSE,2024-01-22,https://arxiv.org/abs/2401.12181,GPT-2,GPT-2 XL,1500000000,,29.47,30.29,51.13,
"Z. Wu, K. D’Oosterlinck, A. Geiger, A. Zur, and C. Potts. Causal proxy models for concept-based model explanations.",1Stanford University 2Ghent University,FALSE,2022-09-28,https://arxiv.org/abs/2209.14279,GPT-2,GPT-2 XL,1500000000,"BERT-base-uncased, RoBERTa-base, GPT-2",29.47,30.29,51.13,
"K. Meng, A. S. Sharma, A. J. Andonian, Y. Belinkov, and D. Bau. Mass-editing memory in a transformer.","MIT, Northeastern, Technion",FALSE,2022-10-13,https://arxiv.org/abs/2210.07229,GPT-NeoX,GPT-NeoX,20000000000,GPT-J (6B) and GPT-NeoX (20B),29.3,45.56,70.925,
"Y. Yao, P. Wang, B. Tian, S. Cheng, Z. Li, S. Deng, H. Chen, and N. Zhang. Editing large language models:
Problems, methods, and opportunities.","zhejiang university, national university of singapore",FALSE,2023-05-22,https://arxiv.org/abs/2305.13172,GPT-NeoX,GPT-NeoX,20000000000,"T5-XL, GPT-J, OPT-13B, GPT-NEOX-20B",29.3,45.56,70.925,
"A. Modarressi, M. Fayyaz, E. Aghazadeh, Y. Yaghoobzadeh, and M. T. Pilehvar. DecompX: Explaining transformers decisions by propagating token decomposition.","1 Center for Information and Language Processing, LMU Munich, Germany
2 Munich Center for Machine Learning (MCML), Germany 3 University of Tehran, Iran
4 Tehran Institute for Advanced Studies, Khatam University, Iran",FALSE,2023-06-05,https://arxiv.org/abs/2306.02873,RoBERTa,RoBERTa-base,125000000, BERT-base-uncased 110m,27.9,0,0,
"G. Kobayashi, T. Kuribayashi, S. Yokoi, and K. Inui. Incorporating Residual and Normalization Layers into
Analysis of Masked Language Models.","Tohoku University,  Langsmith Inc.,  RIKEN",TRUE,2021-09-15,https://arxiv.org/abs/2109.07152,RoBERTa,RoBERTa-base,125000000,BERT-base 110M,27.9,0,0,
"G. Puccetti, A. Rogers, A. Drozd, and F. Dell’Orletta. Outlier dimensions that disrupt transformers are driven by
frequency. ","Scuola Normale Superiore, Istituto di Linguistica Computazionale “Antonio Zampolli”,  University of Copenhagen, RIKEN",FALSE,2022-05-23,https://arxiv.org/abs/2205.11380,RoBERTa,RoBERTa-base,125000000,BERT-base 110M,27.9,0,0,
"H. Mohebbi, W. Zuidema, G. Chrupała, and A. Alishahi. Quantifying context mixing in transformers.","Tilburg University, University of Amsterdam",FALSE,2023-01-30,https://arxiv.org/abs/2301.12971,RoBERTa,RoBERTa-base,125000000,"BERT-base, electra-base (110M)",27.9,0,0,
"X. Wang, K. Wen, Z. Zhang, L. Hou, Z. Liu, and J. Li. Finding skill neurons in pre-trained transformer-based
language models. ",Tsinghua University,FALSE,2022-11-14,https://arxiv.org/abs/2211.07349,RoBERTa,RoBERTa-base,125000000,,27.9,0,0,
"Z. Luo, A. Kulmizev, and X. Mao. Positional artefacts propagate through masked language model embeddings.
","Uppsala University, Fuxi AI Lab",TRUE,2020-11-09,https://arxiv.org/abs/2011.04393,RoBERTa,RoBERTa-base,125000000,"BERT-base, RoBERTa-base",27.9,0,0,
"H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey. Sparse autoencoders find highly interpretable
features in language models.","EleutherAI, MATS, Bristol AI Safety Centre, Apollo Research",TRUE,2023-09-15,https://arxiv.org/abs/2309.08600,Pythia,Pythia 410M,410000000,Pythia 110M,27.25,26.19,40.85,
"W. Rudman, C. Chen, and C. Eickhoff. Outlier dimensions encode task specific knowledge.",Brown University1 University of Tübingen,FALSE,2023-10-26,https://arxiv.org/abs/2310.17715,Pythia,Pythia 410M,410000000,"GPT2, BERT, ALBERT, DistilBERT, Pythia-70M, Pythia-160M, Pythia-410M",27.25,26.19,40.85,
"A. Gupta, A. Rao, and G. K. Anumanchipalli. Model editing at scale leads to gradual and catastrophic forgetting. ",UC Berkeley,FALSE,2024-01-15,https://arxiv.org/abs/2401.07453,GPT-J,GPT-J 6B,6000000000,GPT2-xl 1.5b,27.04,41.38,67.54,
"B. Deiseroth, M. Deb, S. Weinbach, M. Brack, P. Schramowski, and K. Kersting. Atman: Understanding transformer predictions through memory efficient attention manipulation.","1Aleph Alpha 2Technical University Darmstadt
3Hessian Center for Artificial Intelligence (hessian.AI)
4German Center for Artificial Intelligence (DFKI) 5LAION",TRUE,2023-01-19,https://arxiv.org/abs/2301.08110,GPT-J,GPT-J 6B,6000000000,,27.04,41.38,67.54,
"E. Hernandez, A. S. Sharma, T. Haklay, K. Meng, M. Wattenberg, J. Andreas, Y. Belinkov, and D. Bau. Linearity of relation decoding in transformer language models. ","MIT, Northeastern, Technion, Harvard",FALSE,2023-08-17,https://arxiv.org/abs/2308.09124,GPT-J,GPT-J 6B,6000000000,"GPT-2-XL, Llama-13B",27.04,41.38,67.54,
"M. Geva, J. Bastings, K. Filippova, and A. Globerson. Dissecting recall of factual associations in autoregressive language models.","Google Deepmind, Tel Aviv University, Google, Research",TRUE,2023-04-28,https://arxiv.org/abs/2304.14767,GPT-J,GPT-J 6B,6000000000,GPT-2 and GPTJ,27.04,41.38,67.54,
"P. Hase, M. Bansal, B. Kim, and A. Ghandeharioun. Does localization inform editing? surprising differences in
causality-based localization vs. knowledge editing in language models.","Google Research, UNC Chapel Hill",TRUE,2023-01-10,https://arxiv.org/abs/2301.04213,GPT-J,GPT-J 6B,6000000000,GPT-J,27.04,41.38,67.54,
"Z. Li, N. Zhang, Y. Yao, M. Wang, X. Chen, and H. Chen. Unveiling the pitfalls of knowledge editing for
large language models.","zhejiang university, Tencent",TRUE,2023-10-03,https://arxiv.org/abs/2310.02129,GPT-J,GPT-J 6B,6000000000,"GPT2-XL, GPT-J",27.04,41.38,67.54,
"Q. Liu, Y. Chai, S. Wang, Y. Sun, K. Wang, and H. Wu. On training data influence of GPT models. ","Sun Yat-sen University, Baidu Inc., University of Copenhagen",TRUE,2024-04-11,https://arxiv.org/abs/2404.07840,Pythia,Pythia 2.8B,2800000000,Pythia (range of sizes),26.78,36.26,60.66,
"J. Kramár, T. Lieberum, R. Shah, and N. Nanda. Atp*: An efficient and scalable method for localizing llm
behaviour to components.",Google Deepmind,TRUE,2024-03-01,https://arxiv.org/abs/2403.00745,Pythia,Pythia 12B,12000000000,,26.76,38.195,68.82,
"J. Qi, R. Fernández, and A. Bisazza. Cross-lingual consistency of factual knowledge in multilingual language
models.","University of Groningen, University of Amsterdam",FALSE,2023-10-16,https://arxiv.org/abs/2310.10378,bloom,BLOOM (3B),3000000000,"Previous work on multilingual knowl- edge probing (Jiang et al., 2020; Kassner et al., 2021) focused on encoder-only PLMs, such as mBERT (Devlin et al., 2019) or XLM-RoBERTa (Liu et al., 2019). However, since decoder-only PLMs have become mainstream in the current NLP era, our experiments also include the decoder-only BLOOM series (560m, 1.1b, 1.7b, 3b parameters) (Scao et al., 2022) and the encoder-decoder mT5- large (1.2b) (Xue et al., 2021), in addition to the encoder-only XLM-RoBERTa-large (354m).",26.59,35.75,54.37,
"A. Arora, D. Jurafsky, and C. Potts. Causalgym: Benchmarking causal interpretability methods on linguistic
tasks.",Stanford University,FALSE,2024-02-19,https://arxiv.org/abs/2402.12560,Pythia,Pythia 6.9B,6900000000,14M-6.9B,26.48,41.3,67.05,
"W. Gurnee, N. Nanda, M. Pauly, K. Harvey, D. Troitskii, and D. BERTsimas. Finding neurons in a haystack:
Case studies with sparse probing.","MIT, Harvard, Northeastern, Independent",FALSE,2023-05-02,https://arxiv.org/abs/2305.01610,Pythia,Pythia 6.9B,6900000000,,26.48,41.3,67.05,
"Y. Tian, Y. Wang, Z. Zhang, B. Chen, and S. S. Du. JoMA: Demystifying multilayer transformers via joint
dynamics of MLP and attention.","meta, university of washington, carnegie mellon, ut austin",TRUE,2023-10-01,https://arxiv.org/abs/2310.00535,Pythia,Pythia 6.9B,6900000000,"OPT-2.7B, Pythia-70M, Pythia-1.4B",26.48,41.3,67.05,
K. Yin and G. Neubig. Interpreting language models with contrastive explanations. ,"UC Berkeley, CMU",FALSE,2022-02-21,https://arxiv.org/abs/2202.10419,GPT-Neo,GPT-Neo 2.7B,2700000000,GPT-2 (1.5B parameters) and GPT-Neo (2.7 B),26.45,33.36,56.24,
"N. Belrose, Z. Furman, L. Smith, D. Halawi, I. Ostrovsky, L. McKinney, S. Biderman, and J. Steinhardt.
Eliciting latent predictions from transformers with the tuned lens.","EleutherAI, UC Berkeley, FAR AI, University of Toronto, Boston University",TRUE,2023-03-14,https://arxiv.org/abs/2303.08112,GPT-Neo,GPT-Neo 2.7B,2700000000,,26.45,33.36,56.24,
B. Millidge and S. Black. The singular value decompositions of transformer weight matrices are highly interpretable. ,Conjecture,FALSE,2022-11-28,https://www.alignmentforum.org/posts/mkbGjzxD8d8XqKHzA/the-singular-value-decompositions-of-transformer-weight,GPT-2,GPT-2 Large,774000000,,26.07,25.77,45.62,
"C. Guerner, A. Svete, T. Liu, A. Warstadt, and R. Cotterell. A geometric notion of causal probing.",ETH Zurich,FALSE,2023-07-23,https://arxiv.org/abs/2307.15054,GPT-2,GPT-2 Large,774000000,,26.07,25.77,45.62,
"C. Neo, S. B. Cohen, and F. Barez. Interpreting context look-ups in transformers: Investigating attention-mlp
interactions.","Apart Research, University of Edinburgh, University of Oxford, Nanyang Technological University",TRUE,2024-02-23,https://arxiv.org/abs/2402.15055,GPT-2,GPT-2 Large,774000000,,26.07,25.77,45.62,
"M. Geva, A. Caciularu, G. Dar, P. Roit, S. Sadde, M. Shlain, B. Tamir, and Y. Goldberg. LM-debugger: An interactive tool for inspection and intervention in transformer-based language models. ","AI2, BariIlan University, Tel Aviv University, The Hebrew University of Jerusalem",TRUE,2022-04-26,https://arxiv.org/abs/2204.12130,GPT-2,GPT-2 Large,774000000,GPT2 (Medium and Large),26.07,25.77,45.62,
"O. Kovaleva, S. Kulshreshtha, A. Rogers, and A. Rumshisky. BERT busters: Outlier dimensions that disrupt
transformers.","U-Mass Lowell, University of Copenhagen",FALSE,2021-05-14,https://arxiv.org/abs/2105.06990,GPT-2,GPT-2 Large,774000000,"ELECTRA, XLNet, BART, GPT2",26.07,25.77,45.62,
"X. Suau, L. Zappella, and N. Apostoloff. Finding experts in transformer models.",Apple,TRUE,2020-05-15,https://arxiv.org/abs/2005.07647,GPT-2,GPT-2 Large,774000000,"RoBERTa-L, DistilBERT, XLM, BERT-B, GPT2-S, BERT-L, GPT2-M",26.07,25.77,45.62,
"X. Suau, L. Zappella, and N. Apostoloff. Self-conditioning pre-trained language models.",Apple,TRUE,2021-09-30,https://arxiv.org/abs/2110.02802,GPT-2,GPT-2 Large,774000000,,26.07,25.77,45.62,
"B.-D. Oh and W. Schuler. Token-wise decomposition of autoregressive language model hidden states for
analyzing model predictions. ",Ohio State University,FALSE,2023-05-17,https://arxiv.org/abs/2305.10614,OPT,OPT 125M,125000000,,26.02,22.87,31.47,
J. Ferrando and E. Voita. Information flow routes: Automatically interpreting language models at scale. ,"Universitat Politecnica de Catalunya, Meta",TRUE,2024-02-27,https://arxiv.org/abs/2403.00824,OPT,OPT 125M,125000000,and GPT2-small,26.02,22.87,31.47,
"N. Stoehr, M. Gordon, C. Zhang, and O. Lewis. Localizing paragraph memorization in language models.","ETH Zurich, Google",TRUE,2024-03-28,https://arxiv.org/abs/2403.19851,GPT-Neo,GPT-Neo 125M,125000000,,25.97,22.95,30.26,
B. Wright and L. Sharkey. Addressing feature suppression in saes.,MATS,FALSE,2024-02-16,https://www.alignmentforum.org/posts/3JuSjTZyMzaSeTxKk/addressing-feature-suppression-in-saes,Pythia,Pythia 70M,70000000,,25.9,21.59,27.29,
"L. Quirke, L. Heindrich, W. Gurnee, and N. Nanda. Training dynamics of contextual n-grams in language
models.","Independent, MIT",FALSE,2023-11-01,https://arxiv.org/abs/2311.00863,Pythia,Pythia 70M,70000000,Pythia 70M,25.9,21.59,27.29,
"S. Marks, C. Rager, E. J. Michaud, Y. Belinkov, D. Bau, and A. Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models.",Northeastern University,FALSE,2024-03-28,https://arxiv.org/abs/2403.19647,Pythia,Pythia 70M,70000000,,25.9,21.59,27.29,
"A. Conmy, A. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso. Towards automated circuit
discovery for mechanistic interpretability. ","independent, university college london, university of cambridge",FALSE,2023-04-28,https://arxiv.org/abs/2304.14997,GPT-2,GPT-2 Small,124000000,,25.83,22.01,31.53,
"A. Syed, C. Rager, and A. Conmy. Attribution patching outperforms automated circuit discovery.","University of Maryland, College Park, Independent, Independent",FALSE,2023-10-16,https://arxiv.org/abs/2310.10348,GPT-2,GPT-2 Small,124000000,,25.83,22.01,31.53,
"B. Hoover, H. Strobelt, and S. Gehrmann. exBERT: A Visual Analysis Tool to Explore Learned Representations
in Transformer Models.","IBM Research, Harvard SEAS",TRUE,2019-10-11,https://arxiv.org/abs/1910.05276,GPT-2,GPT-2 Small,124000000,BERT,25.83,22.01,31.53,
"C. Kissane, R. Krzyzanowski, A. Conmy, and N. Nanda. Attention saes scale to GPT-2 small. ",MATS,FALSE,2024-02-03,https://www.alignmentforum.org/posts/FSTRedtjuHa4Gfdbr/attention-saes-scale-to-GPT-2-small,GPT-2,GPT-2 Small,124000000,,25.83,22.01,31.53,
"C. McDougall, A. Conmy, C. Rushing, T. McGrath, and N. Nanda. Copy suppression: Comprehensively
understanding an attention head.","University of Texas at Austin, Google DeepMind, Independent",TRUE,2023-10-06,https://arxiv.org/abs/2310.04625,GPT-2,GPT-2 Small,124000000,,25.83,22.01,31.53,
"C. Tigges, O. J. Hollinsworth, A. Geiger, and N. Nanda. Linear representations of sentiment in large language
models.","EleutherAI Institute, SERI MATS, Stanford University, Pr(AI)²R Group, Independent",TRUE,2023-10-23,https://arxiv.org/abs/2310.15154,GPT-2,GPT-2 Small,124000000,Pythia 85m to 2.8b,25.83,22.01,31.53,
"I. Tenney, P. Xia, B. Chen, A. Wang, A. Poliak, R. T. McCoy, N. Kim, B. V. Durme, S. Bowman, D. Das,
and E. Pavlick. What do you learn from context? probing for sentence structure in contextualized word
representations. ","Google AI Language, Johns Hopkins University, Swarthmore College, New York University, Brown University",TRUE,2019-05-15,https://arxiv.org/abs/1905.06316,GPT-2,GPT-2 Small,124000000,"CoVe, ELMo, GPT, and BERT",25.83,22.01,31.53,
J. Bloom and J. Lin. Understanding SAE features with the logit lens. ,MATS,FALSE,2024-03-10,https://www.alignmentforum.org/posts/qykrYY6rXXM7EEs8Q/understanding-sae-features-with-the-logit-lens,GPT-2,GPT-2 Small,124000000,,25.83,22.01,31.53,
J. Bloom. Open source sparse autoencoders for all residual stream layers of GPT2 small. ,MATS,FALSE,2024-02-02,https://www.alignmentforum.org/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream,GPT-2,GPT-2 Small,124000000,,25.83,22.01,31.53,
J. Vig. A multiscale visualization of attention in the transformer model. ,Palo Alto Research Center,FALSE,2019-06-12,https://arxiv.org/abs/1906.05714,GPT-2,GPT-2 Small,124000000,BERT-base,25.83,22.01,31.53,
"K. Ethayarajh. How contextual are contextualized word representations? Comparing the geometry of BERT,
ELMo, and GPT-2 embeddings. ",Stanford,FALSE,2019-09-02,https://arxiv.org/abs/1909.00512,GPT-2,GPT-2 Small,124000000,"BERT, ELMo, and GPT-2",25.83,22.01,31.53,
"K. R. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt. Interpretability in the wild: a circuit
for indirect object identification in GPT-2 small. ","Redwood Research, UC Berkeley",TRUE,2022-11-01,https://arxiv.org/abs/2211.00593,GPT-2,GPT-2 Small,124000000,GPT-2 Small,25.83,22.01,31.53,
"M. Geva, A. Caciularu, K. Wang, and Y. Goldberg. Transformer feed-forward layers build predictions by
promoting concepts in the vocabulary space. ","AI2, Bar-Ilan University, Independent",TRUE,2022-03-28,https://arxiv.org/abs/2203.14680,GPT-2,GPT-2 Small,124000000,WikiLM and GPT2-small,25.83,22.01,31.53,
"M. Hanna, O. Liu, and A. Variengien. How does GPT-2 compute greater-than?: Interpreting mathematical
abilities in a pre-trained language model.","University of Amsterdam, USC, Redwood Research",TRUE,2023-04-30,https://arxiv.org/abs/2305.00586,GPT-2,GPT-2 Small,124000000,GPT2-small,25.83,22.01,31.53,
"M. Hanna, S. Pezzelle, and Y. Belinkov. Have faith in faithfulness: Going beyond circuit overlap when finding
model mechanisms.","Technion, University of Amsterdam",FALSE,2024-03-26,https://arxiv.org/abs/2403.17806,GPT-2,GPT-2 Small,124000000,,25.83,22.01,31.53,
"M. Sakarvadia, A. Khan, A. Ajith, D. Grzenda, N. Hudson, A. Bauer, K. Chard, and I. Foster. Attention lens:
A tool for mechanistically interpreting the attention head information retrieval mechanism.","U Chicago, Argonne National Laboratory",FALSE,2023-10-25,https://arxiv.org/abs/2310.16270,GPT-2,GPT-2 Small,124000000,GPT2-small,25.83,22.01,31.53,
N. Nanda. Attribution patching: Activation patching at industrial scale.,Independent,FALSE,2023-02-04,https://www.neelnanda.io/mechanistic-interpretability/attribution-patching,GPT-2,GPT-2 Small,124000000,,25.83,22.01,31.53,
nostalgebraist. Interpreting GPT: the logit lens.,Independent,FALSE,2020-08-30,https://www.alignmentforum.org/posts/AcKRB8wDpdaN6v6ru/interpreting-GPT-the-logit-lens,GPT-2,GPT-2 Small,124000000,GPT2,25.83,22.01,31.53,
"R. Krzyzanowski, C. Kissane, A. Conmy, and N. Nanda. We inspected every head in GPT-2 small using saes
so you don’t have to.",MATS,FALSE,2024-03-06,https://www.alignmentforum.org/posts/xmegeW5mqiBsvoaim/we-inspected-every-head-in-GPT-2-small-using-saes-so-you-don,GPT-2,GPT-2 Small,124000000,,25.83,22.01,31.53,
R. Molina. Traveling words: A geometric interpretation of transformers.,Independent,FALSE,2023-09-13,https://arxiv.org/abs/2309.07315,GPT-2,GPT-2 Small,124000000,GPT2-small,25.83,22.01,31.53,
W. Gurnee. Sae reconstruction errors are (empirically) pathological. ,MIT,FALSE,2024-03-29,https://www.alignmentforum.org/posts/rZPiuFxESMxCDHe4B/sae-reconstruction-errors-are-empirically-pathological,GPT-2,GPT-2 Small,124000000,,25.83,22.01,31.53,
"W. Timkey and M. van Schijndel. All bark and no bite: Rogue dimensions in transformer language models
obscure representational quality. ",Cornell University,FALSE,2021-09-09,https://arxiv.org/abs/2109.04404,GPT-2,GPT-2 Small,124000000,"BERT, RoBERTa, GPT-2, XLNet (GPT2, xlnet-base-cased, BERT-base-cased, roBERTa-base)",25.83,22.01,31.53,
"Z. Wu, A. Geiger, J. Huang, A. Arora, T. Icard, C. Potts, and N. D. Goodman. A reply to makelov et al. (2023)’s
""interpretability illusion"" arguments.","Stanford, Pr(AI)^2R Group",FALSE,2024-01-23,https://arxiv.org/abs/2401.12631,GPT-2,GPT-2 Small,124000000,,25.83,22.01,31.53,
C. Rushing and N. Nanda. Explorations of self-repair in language models.,University of Texas at Austin,FALSE,2024-02-23,https://arxiv.org/abs/2402.15390,Pythia,Pythia 160M,160000000,,24.95,22.78,30.34,
"X. L. Li, A. Holtzman, D. Fried, P. Liang, J. Eisner, T. Hashimoto, L. Zettlemoyer, and M. Lewis. Contrastive
decoding: Open-ended text generation as optimization.","Stanford, University of Washington, CMU, Facebook AI Research, Johns Hopkins",TRUE,2022-10-27,https://arxiv.org/abs/2210.15097,OPT,OPT 13B,13000000000,"OPT13-13B, OPT-125M",24.9,39.93,71.2,
Z. Zhao and B. Shan. Reagent: A model-agnostic feature attribution method for generative language models.,University of Sheffield,FALSE,2024-02-01,https://arxiv.org/abs/2402.00794,OPT,OPT 6.7B,6700000000,"GPT-{354M,1.5B,6B}, OPT-{350M,1.3B,6.7B}",24.57,39.16,59.48,
"A. Geiger, H. Lu, T. Icard, and C. Potts. Causal abstractions of neural networks. ",Stanford University,FALSE,2021-06-06,https://arxiv.org/abs/2106.02997,BERT,BERT-base,110000000,BLISTM,0,0,40.5,
"A. Geiger, K. Richardson, and C. Potts. Neural natural language inference models partially embed theories
of lexical entailment and negation.","Stanford University, Allen Institute for AI",TRUE,2020-04-30,https://arxiv.org/abs/2004.14623,BERT,BERT-base,110000000,"BLISTM, ESIM",0,0,40.5,
"A. Geiger, Z. Wu, C. Potts, T. Icard, and N. D. Goodman. Finding alignments between interpretable causal
variables and distributed neural representations.","Stanford University, Pr(AI)²R Group",FALSE,2023-03-05,https://arxiv.org/abs/2303.02536,BERT,BERT-base,110000000,,0,0,40.5,
"A. Geiger, Z. Wu, H. Lu, J. Rozner, E. Kreiss, T. Icard, N. D. Goodman, and C. Potts. Inducing causal structure
for interpretable neural networks.",Stanford University,FALSE,2021-12-1,https://arxiv.org/abs/2112.00826,BERT,BERT-base,110000000,,0,0,40.5,
"D. Dai, L. Dong, Y. Hao, Z. Sui, B. Chang, and F. Wei. Knowledge neurons in pretrained transformers. ","Peking University, Microsoft Research",TRUE,2021-04-18,https://arxiv.org/abs/2104.08696,BERT,BERT-base,110000000,,0,0,40.5,
"G. Brunner, Y. Liu, D. Pascual, O. Richter, M. Ciaramita, and R. Wattenhofer. On identifiability in transformers.
","ETH Zurich, Google Research",TRUE,2019-08-12,https://arxiv.org/abs/1908.04211,BERT,BERT-base,110000000,,0,0,40.5,
"G. Kobayashi, T. Kuribayashi, S. Yokoi, and K. Inui. Attention is not only a weight: Analyzing transformers
with vector norms.","Tohoku University,  Langsmith Inc.,  RIKEN",TRUE,2020-04-21,https://arxiv.org/abs/2004.10102,BERT,BERT-base,110000000,,0,0,40.5,
"H. Chefer, S. Gur, and L. Wolf. Transformer interpretability beyond attention visualization.",Tel Aviv and Facebook AI Research,TRUE,2020-10-17,https://arxiv.org/abs/2012.09838,BERT,BERT-base,110000000,,0,0,40.5,
"J. Bastings, S. EBERT, P. Zablotskaia, A. Sandholm, and K. Filippova. “will you find these shortcuts?” a
protocol for evaluating the faithfulness of input salience methods for text classification. ",Google Research,TRUE,2021-11-14,https://arxiv.org/abs/2111.07367,BERT,BERT-base,110000000,,0,0,40.5,
"J. DeYoung, S. Jain, N. F. Rajani, E. Lehman, C. Xiong, R. Socher, and B. C. Wallace. ERASER: A benchmark
to evaluate rationalized NLP models.","Northeastern University, Salesforce Research",TRUE,2019-11-08,https://arxiv.org/abs/1911.03429v1,BERT,BERT-base,110000000,,0,0,40.5,
J. Enguehard. Sequential integrated gradients: a simple but effective method for explaining language models. ,"Skippr, Babylon Health",TRUE,2023-05-25,https://arxiv.org/abs/2305.15853,BERT,BERT-base,110000000,"BERT, DistilBERT and RoBERTa",0,0,40.5,
"J. Ferrando, G. I. Gállego, and M. R. Costa-jussà. Measuring the mixing of contextual information in the
transformer.",Universitat Politècnica de Catalunya,FALSE,2022-03-08,https://arxiv.org/abs/2203.04212,BERT,BERT-base,110000000,"BERT, DistilBERT and RoBERTa",0,0,40.5,
"K. Clark, U. Khandelwal, O. Levy, and C. D. Manning. What does BERT look at? an analysis of BERT’s
attention.","Stanford, Facebook AI Research",TRUE,2019-06-11,https://arxiv.org/abs/1906.04341,BERT,BERT-base,110000000,,0,0,40.5,
"N. Belrose, D. Schneider-Joseph, S. Ravfogel, R. Cotterell, E. Raff, and S. Biderman. LEACE: Perfect linear
concept erasure in closed form.","EleutherAI, Bar-Ilan University, Booz-Allen Hamilton, ETH Zurich",TRUE,2023-06-06,https://arxiv.org/abs/2306.03819,BERT,BERT-base,110000000,,0,0,40.5,
"O. Kovaleva, A. Romanov, A. Rogers, and A. Rumshisky. Revealing the dark secrets of BERT. ",U-Mass Lowell,FALSE,2019-08-21,https://arxiv.org/abs/1908.08593,BERT,BERT-base,110000000,BERT,0,0,40.5,
"P. Atanasova, J. G. Simonsen, C. Lioma, and I. Augenstein. A diagnostic study of explainability techniques
for text classification.",University of Copenhagen,FALSE,2020-09-25,https://arxiv.org/abs/2009.13295,BERT,BERT-base,110000000,"CNN, BERT, Transformer, LSTM",0,0,40.5,
"P. M. Htut, J. Phang, S. Bordia, and S. R. Bowman. Do attention heads in BERT track syntactic dependencies?",NYU,FALSE,2019-11-27,https://arxiv.org/abs/1911.12246,BERT,BERT-base,110000000,"BERT, Roborta",0,0,40.5,
"P. Michel, O. Levy, and G. Neubig. Are sixteen heads really better than one? ","CMU, Facebook AI Research",TRUE,2019-05-25,https://arxiv.org/abs/1905.10650,BERT,BERT-base,110000000,BERT,0,0,40.5,
"P. Pezeshkpour, S. Jain, S. Singh, and B. Wallace. Combining feature and instance attribution to detect artifacts.
","UC Irvine, Northeastern University",FALSE,2021-07-01,https://arxiv.org/abs/2107.00323,BERT,BERT-base,110000000,BERT,0,0,40.5,
"S. Ravfogel, M. Twiton, Y. Goldberg, and R. D. Cotterell. Linear adversarial concept erasure. ","Bar Ilan University, Allen Institue for Artificial Intelligence, ETH Zurich, Independent",TRUE,2022-01-28,https://arxiv.org/abs/2201.12091,BERT,BERT-base,110000000,Didnt specify whether they used the big or small version of BERT,0,0,40.5,
"T. Mickus, D. Paperno, and M. Constant. How to dissect a Muppet: The structure of transformer embedding
spaces.","University of Helsinki,  Utrecht University, Universite de Lorraine",FALSE,2022-06-07,https://arxiv.org/abs/2206.03529,BERT,BERT-base,110000000,,0,0,40.5,
"X. Han, B. C. Wallace, and Y. Tsvetkov. Explaining black box predictions and unveiling data artifacts
through influence functions.","CMU, Northeastern University",FALSE,2020-05-14,https://arxiv.org/abs/2005.06676,BERT,BERT-base,110000000,,0,0,40.5,
"Y. Bondarenko, M. Nagel, and T. Blankevoort. Quantizable transformers: Removing outliers by helping attention heads do nothing. ",qualcomm,TRUE,2023-06-22,https://arxiv.org/abs/2306.12929,BERT,BERT-base,110000000,,0,0,40.5,
"Y. Elazar, S. Ravfogel, A. Jacovi, and Y. Goldberg. Amnesic probing: Behavioral explanation with amnesic
counterfactuals. ",bar illan university,FALSE,2020-06-01,https://arxiv.org/abs/2006.00995,BERT,BERT-base,110000000,,0,0,40.5,
"N. De Cao, W. Aziz, and I. Titov. Editing factual knowledge in language models.","University of Amsterdam, University of Edinburgh",FALSE,2021-04-16,https://arxiv.org/abs/2104.08164,BART,BART base,139000000,BERT (fintuned),0,0,0,
"W. Merrill, V. Ramanujan, Y. Goldberg, R. Schwartz, and N. A. Smith. Effects of parameter norm growth
during transformer training: Inductive bias from gradient descent. ","Allen Institute for AI, New York University, University of Washington, Bar Ilan University,  Hebrew University of Jerusalem",TRUE,2020-10-19,https://arxiv.org/abs/2010.09697,t5,t5-base,220000000,,0,0,0,
"A. Modarressi, M. Fayyaz, Y. Yaghoobzadeh, and M. T. Pilehvar. GlobEnc: Quantifying global token attribution by incorporating the whole encoder layer in transformers. ","Iran University of Science and Technology, Iran 2 University of Tehran, Iran
3 Tehran Institute for Advanced Studies, Khatam University, Iran",FALSE,2022-05-06,https://arxiv.org/abs/2205.03286,BERT,BERT-large,340000000," BERT-base-uncased 110m, electra 335m",0,0,47.3,
"A. Rogers, O. Kovaleva, and A. Rumshisky. A Primer in BERTology: What We Know About How BERT
Works. ","University of Copenhagen, University of
Massachusetts Lowell",FALSE,2020-02-27,https://arxiv.org/abs/2002.12327,BERT,BERT-large,340000000,,0,0,47.3,
"A. Y. Din, T. Karidi, L. Choshen, and M. Geva. Jump to conclusions: Short-cutting transformers with linear
transformations.",1Hebrew University of Jerusalem 2Tel Aviv University,FALSE,2023-03-16,https://arxiv.org/abs/2303.09435,BERT,BERT-large,340000000," GPT-2, BERT-base-uncase",0,0,47.3,
"G. Kobayashi, T. Kuribayashi, S. Yokoi, and K. Inui. Analyzing feed-forward blocks in transformers through
the lens of attention map. ","Tohoku University, MBZUAI, RIKEN",TRUE,2023-02-01,https://arxiv.org/abs/2302.00456,BERT,BERT-large,340000000,"many other smaller models, incl. 5 other variants of BERT and GPT-2-small",0,0,47.3,
"I. Tenney, D. Das, and E. Pavlick. BERT rediscovers the classical NLP pipeline.","Google Research, Brown University",TRUE,2019-05-15,https://arxiv.org/abs/1905.05950,BERT,BERT-large,340000000,,0,0,47.3,
J. Hewitt and C. D. Manning. A structural probe for finding syntax in word representations.,Stanford,FALSE,2019-06-01,https://aclanthology.org/N19-1419,BERT,BERT-large,340000000,"BERT-base (110M9, elmo-base (130M)",0,0,47.3,
"N. De Cao, M. S. Schlichtkrull, W. Aziz, and I. Titov. How do decisions emerge across layers in neural models?
interpretation with differentiable masking. ","University of Amsterdam, University of Edinburgh",FALSE,2020-04-30,https://arxiv.org/abs/2004.14992,BERT,BERT-large,340000000,"BERT-based, BERTLarge",0,0,47.3,
"N. F. Liu, M. Gardner, Y. Belinkov, M. E. Peters, and N. A. Smith. Linguistic knowledge and transferability of contextual representations. ","University of Washington, Allen Institute for AI, Harvard, MIT CSAIL",FALSE,2019-03-21,https://arxiv.org/abs/1903.08855,BERT,BERT-large,340000000,"BERT (based, large) ELMO, GPT2",0,0,47.3,
"T. Bolukbasi, A. Pearce, A. Yuan, A. Coenen, E. Reif, F. Viégas, and M. Wattenberg. An interpretability illusion
for BERT.",Google Research,TRUE,2021-04-21,https://arxiv.org/abs/2104.07143,BERT,BERT-large,340000000,,0,0,47.3,
"T. McCoy, E. Pavlick, and T. Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural
language inference.","Johns Hopkins, Brown University",FALSE,2019-02-04,https://arxiv.org/abs/1902.01007,BERT,BERT-large,340000000,"BERT, ESIM, SPINN, and DA. BERT is the only transformer. ESIM is an RNN, SPINN is a TreeRNN, DA is a bag-of-words.",0,0,47.3,
"T. Pimentel, J. Valvoda, R. H. Maudslay, R. Zmigrod, A. Williams, and R. Cotterell. Information-theoretic
probing for linguistic structure.","University of Cambridge, Facebook AI Research, ETH Zürich",TRUE,2020-04-07,https://arxiv.org/abs/2004.03061,BERT,BERT-large,340000000,"fastText, one-hot encoding vectors",0,0,47.3,
"V. Lal, A. Ma, E. Aflalo, P. Howard, A. Simoes, D. Korat, O. Pereg, G. Singer, and M. Wasserblat. InterpreT: An
interactive visualization tool for interpreting transformers.",Intel Labs,TRUE,2021-04-19,https://aclanthology.org/2021.eacl-demos.17,BERT,BERT-large,340000000,,0,0,47.3,
"Y. Lin, Y. C. Tan, and R. Frank. Open sesame: Getting inside BERT’s linguistic knowledge. ",yale,FALSE,2019-06-04,https://arxiv.org/abs/1906.01698,BERT,BERT-large,340000000,,0,0,47.3,
S. Sanyal and X. Ren. Discretized integrated gradients for explaining language models. ,USC,FALSE,2021-08-31,https://arxiv.org/abs/2108.13654,RoBERTa,RoBERTa-large,354000000,BERT (3.4E8) DistilBERT (6.6E7),0,0,81.7,
"S. Yang, S. Huang, W. Zou, J. Zhang, X. Dai, and J. Chen. Local interpretation of transformer based on linear
decomposition.","
Nanjing University",FALSE,2023-07-09,https://aclanthology.org/2023.acl-long.572,RoBERTa,RoBERTa-large,354000000,,0,0,81.7,
"A. Haviv, I. Cohen, J. Gidron, R. Schuster, Y. Goldberg, and M. Geva. Understanding transformer memorization
recall through idioms.","Tel Aviv University, Wild Moose, Bar-Ilan University, Allen Institute for AI",TRUE,2022-10-07,https://arxiv.org/abs/2210.03588,GPT-2,GPT-2 Medium,355000000,"ROBERTA-BASE 125m, T5-BASE 223m, ELECTRA-BASE-GENERATOR 110m",0,0,0,
"G. Dar, M. Geva, A. Gupta, and J. Berant. Analyzing transformers in embedding space.","Tel Aviv University, AI2",TRUE,2022-09-06,https://arxiv.org/abs/2209.02535,GPT-2,GPT-2 Medium,355000000,,0,0,0,
"J. Merullo, C. Eickhoff, and E. Pavlick. Circuit component reuse across tasks in transformer language models.","Brown University, University of Tu ̈bingen",FALSE,2023-10-12,https://arxiv.org/abs/2310.08744,GPT-2,GPT-2 Medium,355000000,,0,0,0,
"M. A. Lepori, T. Serre, and E. Pavlick. Uncovering intermediate variables in transformers using circuit probing.",Brown University,FALSE,2023-11-07,https://arxiv.org/abs/2311.04354,GPT-2,GPT-2 Medium,355000000,GPT2-Small and GPT2-Medium,0,0,0,
S. Katz and Y. Belinkov. VISIT: Visualizing and interpreting the semantic information flow of transformers.,Technion,FALSE,2023-05-22,https://arxiv.org/abs/2305.13417,GPT-2,GPT-2 Medium,355000000,,0,0,0,
"J. Ferrando, G. I. Gállego, B. Alastruey, C. Escolano, and M. R. Costa-jussà. Towards opening the black
box of neural machine translation: Source and target interpretations of the transformer.","Universitat Politècnica de Catalunya, Meta AI",TRUE,2022-05-23,https://arxiv.org/abs/2205.11631,M2M,M2M100,418000000,,0,0,0,
C. Fierro and A. Søgaard. Factual consistency of multilingual pretrained language models. ,University of Copenhagen,FALSE,2022-03-22,https://arxiv.org/abs/2203.11552,XLM-RoBERTa,XLM-RoBERTa-large,550000000,,0,0,0,
"N. Durrani, F. Dalvi, and H. Sajjad. Discovering salient neurons in deep nlp models. ","Hamad Bin Khalifa University, Dalhousie University",FALSE,2022-06-27,https://arxiv.org/abs/2206.13288,XLM-RoBERTa,XLM-RoBERTa-large,550000000,"BERT, ROBERTA and XLNET",0,0,0,
"E. Akyurek, T. Bolukbasi, F. Liu, B. Xiong, I. Tenney, J. Andreas, and K. Guu. Towards tracing knowledge
in language models back to the training data. ","Google Research, MIT CSAIL",TRUE,2022-05-23,https://arxiv.org/abs/2205.11482,mt5,mt5,580000000,,0,0,0,
"G. Sarti, G. Chrupała, M. Nissim, and A. Bisazza. Quantifying the plausibility of context reliance in neural
machine translation. ","University of Groningen, Tilburg University",FALSE,2023-10-02,https://arxiv.org/abs/2310.01188,MBART,MBART,680000000,,0,0,0,
"E. Mitchell, C. Lin, A. Bosselut, C. D. Manning, and C. Finn. Memory-based model editing at scale. ","Stanford, EPFL",FALSE,2022-06-13,https://arxiv.org/abs/2206.06520,t5,t5-large,770000000,"BERT-base (110M), BlenderBot-90M",0,0,38.9,
"R. Nogueira, Z. Jiang, and J. Lin. Investigating the limitations of transformers with simple arithmetic tasks,
2021.",University of Waterloo,FALSE,2021-02-25,https://arxiv.org/abs/2102.13019,t5,t5-large,770000000,T5 models (except 11B),0,0,38.9,
"M. Costa-jussà, E. Smith, C. Ropers, D. Licht, J. Maillard, J. Ferrando, and C. Escolano. Toxicity in
multilingual machine translation at scale.","FAIR (META), Universitat Politecnica de Catalunya",TRUE,2022-10-06,https://arxiv.org/abs/2210.03070,NLLB,NLLB-200,3300000000,NLLB-200 3.3B,0,0,0,
"Z. Wu, A. Geiger, T. Icard, C. Potts, and N. Goodman. Interpretability at scale: Identifying causal mechanisms
in alpaca.",Stanford,FALSE,2023-05-15,https://arxiv.org/abs/2305.08809,Llama,Alpaca,7000000000,notes,0,0,0,
"C. Burns, H. Ye, D. Klein, and J. Steinhardt. Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations.","UC Berkeley, Peking University, UC Berkeley, UC Berkeley",FALSE,2022-12-07,https://arxiv.org/abs/2212.03827,t5,t5-xxl,11000000000,"UnifiedQA, T0, GPT-J, RoBERTa, DeBERTa",0,0,0,
"E. Mitchell, C. Lin, A. Bosselut, C. Finn, and C. D. Manning. Fast model editing at scale.",Stanford,FALSE,2021-10-21,https://arxiv.org/abs/2110.11309,t5,t5-xxl,11000000000,"T5-XL (2.8B), GPT-J (6B), GPT-Neo (2.7B), BERT-base (110M), BART-base (139M), distilGPT-2 (82M)",0,0,0,
"T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer. GPT3.int8(): 8-bit matrix multiplication for transformers at scale.","University of Washington, Facebook AI Research, Hugging Face, ENS Paris-Saclay",TRUE,2022-08-15,https://arxiv.org/abs/2208.07339,OPT,OPT 175B,175000000000,,0,43.94,0,