Evaluating and Mitigating Contextual Vulnerabilities in LLMs: An Architectural Approach to Resisting Multi-Turn Jailbreaks

Adarsh Kumarappan; Ananya Mujoo

Evaluating and Mitigating Contextual Vulnerabilities in LLMs: An Architectural Approach to Resisting Multi-Turn Jailbreaks

Adarsh Kumarappan, Ananya Mujoo

Published: 27 Oct 2025, Last Modified: 27 Oct 2025NeurIPS Lock-LLM Workshop 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, LLM Security, Jailbreaking, Multi-Turn Attacks, Contextual Vulnerability, Adversarial Attacks, Red Teaming, Evaluation Framework, Benchmark Generation, Architectural Defense, Pretext Stripping

Abstract: Large Language Models (LLMs) remain vulnerable to multi-turn conversational attacks that bypass safety alignments using psychological manipulation. However, progress in building robust defenses is hindered by the lack of systematic frameworks to evaluate how safety architectures handle conversational context. This paper introduces a scalable evaluation framework to test LLM defenses by automatically generating 1,500 psychologically-grounded attack scenarios. Using this framework, we perform the first large-scale comparative analysis of contextual safety, revealing a critical architectural divergence: models in the GPT family are highly susceptible to conversational history, with defense failure rates increasing by up to 32 percentage points, while Google's Gemini 2.5 Flash is nearly immune. Based on this analysis, we propose "pretext stripping," a novel and practical defense mechanism that neutralizes narrative-based manipulation by re-evaluating the final harmful prompt in isolation. Our work provides both a robust methodology for benchmarking contextual LLM security and a practical architectural defense to make models more resistant to exploitation.

Submission Number: 29

Loading