Keywords: Steering Vectors, Security, Post Training
TL;DR: LSMAS is a framework for diagnosing and interpreting LLM security-domain behaviors by continuously steering activations and analyzing how intervention strength and layer depth affect model biases.
Abstract: The growing use of Large Language Models (LLMs) brings significant security challenges, including jailbreaking, misinformation injection, and prompt obfuscation. However, the internal mechanisms that enable such vulnerabilities remain poorly understood. We present $\textbf{LSMAS}$, a diagnostic framework for continuous activation steering, which extends LLM security analysis from discrete before/after interventions to interpretable trajectories of model behavior. By combining steering vector construction with dense $\alpha$-sweeps, logit lens-based bias curves, and layer-site sensitivity analysis, our approach identifies tipping points where small perturbations cause models to bypass guardrails or flip security-relevant behaviors. We argue that these continuous diagnostics offer a bridge between high-level behavioral evaluation and low-level representational dynamics, contributing to the interpretability of LLMs on security tasks. Lastly, we release a CLI and datasets for benchmarking various LLM security behaviors at the project repository, https://anonymous.4open.science/r/LSMAS-82A0/README.md.
Submission Number: 20
Loading