Keywords: Large Language Model, Sparse Autoencoder, Steering, Persona
Abstract: Persona-conditioned generation is a core capability of large language models, yet persona consistency degrades under increasing task complexity. Existing approaches treat persona as a surface-level behavioral constraint imposed through prompting or fine-tuning, offering limited interpretability and control. We instead advance a representational account of persona alignment, modeling persona as a latent and distributed structure within internal model representations. Through layer-wise Sparse Autoencoders and causal latent interventions, we identify persona-relevant features across model depth and show that persona signals become increasingly discriminative in deeper layers. We demonstrate that latent steering enables stable and continuous control of persona intensity at inference time without degrading semantic content or general language competence. These results establish latent representation access as a principled alternative to output-level optimization for controllable generation.
Paper Type: Long
Research Area: Special Theme (conference specific)
Research Area Keywords: interpretability, feature attribution probing, model editing, robustness
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: English
Submission Number: 8784
Loading