From Hidden to Recognized: Direct Decoding of Named Entities from Sparse Autoencoder Features in Large Language Models
Abstract: Large Language Models (LLMs) are increasingly used for Named Entity Recognition (NER) and synthetic data generation, yet their label annotation processes remain largely black boxes. This lack of transparency hinders reliability and control in LLM-based annotation pipelines. To address this, we investigate whether Sparse Autoencoders (SAEs) can extract interpretable features from LLM activations to decode named entities directly. Evaluating on general and biomedical NER datasets, we show that SAEs effectively capture entity-relevant features, outperforming standard probing classifiers in the biomedical domain. Our findings suggest that SAEs offer a promising step toward more transparent and controllable LLM-based annotation and synthetic data generation pipelines.
Paper Type: Short
Research Area: Information Extraction
Research Area Keywords: named entity recognition, open information extraction, sparse autoencoder, mechanistic interpretability
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings
Languages Studied: English
Submission Number: 2170
Loading