Keywords: safety and alignment, cross-modal application, probing, robustness
Abstract: Large language models (LLMs) are widely used for task understanding and action planning in embodied intelligence (EI) systems, but their adoption substantially increases vulnerability to jailbreak attacks. While recent work explores inference-time defenses, existing methods rely on $static$ interventions on intermediate representations, which often degrade generation quality and impair adherence to task instructions, reducing system usability in EI settings. We propose a $dynamic$ defense framework. For each EI inference request, we dynamically construct a task-specific safety-semantic subspace, project its hidden state to the most relevant direction, and apply SLERP rotation for adaptive safety control. At comparable defense success rates, our method preserves generation quality, improves usability, reduces tuning cost, and strengthens robustness in EI scenarios.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: safety and alignment, cross-modal application, probing, robustness
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 2318
Loading