From Hidden to Recognized: Direct Decoding of Named Entities from Sparse Autoencoder Features in Large Language Models

From Hidden to Recognized: Direct Decoding of Named Entities from Sparse Autoencoder Features in Large Language Models

ACL ARR 2025 May Submission2170 Authors

18 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) are increasingly used for Named Entity Recognition (NER) and synthetic data generation, yet their label annotation processes remain largely black boxes. This lack of transparency hinders reliability and control in LLM-based annotation pipelines. To address this, we investigate whether Sparse Autoencoders (SAEs) can extract interpretable features from LLM activations to decode named entities directly. Evaluating on general and biomedical NER datasets, we show that SAEs effectively capture entity-relevant features, outperforming standard probing classifiers in the biomedical domain. Our findings suggest that SAEs offer a promising step toward more transparent and controllable LLM-based annotation and synthetic data generation pipelines.

Paper Type: Short

Research Area: Information Extraction

Research Area Keywords: named entity recognition, open information extraction, sparse autoencoder, mechanistic interpretability

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings

Languages Studied: English

Submission Number: 2170

Loading