Persona Is Latent: Interpretable Alignment in Large Language Models

Persona Is Latent: Interpretable Alignment in Large Language Models

ACL ARR 2026 January Submission8784 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, Sparse Autoencoder, Steering, Persona

Abstract: Persona-conditioned generation is a core capability of large language models, yet persona consistency degrades under increasing task complexity. Existing approaches treat persona as a surface-level behavioral constraint imposed through prompting or fine-tuning, offering limited interpretability and control. We instead advance a representational account of persona alignment, modeling persona as a latent and distributed structure within internal model representations. Through layer-wise Sparse Autoencoders and causal latent interventions, we identify persona-relevant features across model depth and show that persona signals become increasingly discriminative in deeper layers. We demonstrate that latent steering enables stable and continuous control of persona intensity at inference time without degrading semantic content or general language competence. These results establish latent representation access as a principled alternative to output-level optimization for controllable generation.

Paper Type: Long

Research Area: Special Theme (conference specific)

Research Area Keywords: interpretability, feature attribution probing, model editing, robustness

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis

Languages Studied: English

Submission Number: 8784

Loading