# Introduction

Large Language Models (LLMs), especially open-source LLMs, have achieved remarkable success across various critical domains. However, their open nature also inadvertently introduces significant security risks, particularly through embedding space poisoning. While previous research has investigated universal perturbation methods, the dynamics of LLM safety alignment at the embedding level remain insufficiently understood despite their potential severity. 

We propose **ETTA (Embedding Transformation Toxicity Attenuation)**, a novel framework that identifies and attenuates toxicity-sensitive dimensions in embedding space via linear transformations. ETTA bypasses model refusal behaviors while preserving linguistic coherence, without requiring model fine-tuning or access to training data. Evaluated on five representative open-source LLMs, ETTA achieves a high average attack success rate of 88.61\%, outperforming the best baseline by 11.34\%, and generalizes to safety-enhanced models (e.g., 77.39\% ASR on instruction-tuned defenses). These results highlight a critical vulnerability in current alignment strategies and the need for embedding-aware defenses.




# Supplementary Project
## Quick start
1. Create and activate a Python 3.10+ environment.
2. `pip install -e .` (editable install).
3. Copy `env.sample` to `.env` and edit as needed.
4. Run the demo with explicit flags (no defaults are provided):
```bash
python -m supplementary_project.run_demo \
--model_name_or_path "" \
--advbench_csv "" \
--csv_out "" \
--json_out "" \
--device cuda:0
```

Environment and arguments are required; there are no embedded defaults.


## env.sample
```bash
# Leave empty unless you provide your own key/endpoint.
OPENAI_API_KEY=
OPENAI_BASE_URL=
```
