<p align="center">
  <img src="inference_time_experiments/static/benign-robot.png" alt="Benign Robot Logo" width="200">
</p>

# Obfuscated Activations Bypass LLM Latent-Space Defenses


This repository contains the code for the paper Obfuscated Activations Bypass LLM Latent-Space Defenses. The codebase is organized into two main components.

## Repository Structure

```
.
├── inference_time_experiments/  # Code for inference-time attacks (Sections 3 & A)
└── train_time_experiments/      # Code for training-time attacks (Sections 4 & B)
```

### Inference Time Experiments
The [`inference_time_experiments/`](inference_time_experiments/) directory contains code for:
- Section 3: Case Study: Detecting Harmfulness
- Section 4: Case Study: SQL Detection

This includes implementations for attacking models at inference time through suffix based attacks, with experiments on harmfulness detection and SAE probing.

### Training Time Experiments
The [`train_time_experiments/`](train_time_experiments/) directory contains code for:
- Section A: Case Study: Backdoor Obfuscation  
- Section B: Training Models to be Easier to Monitor (Obfuscated Adversarial Training)

This includes implementations for training backdoored models with both explicit and natural obfuscation strategies, as well as Obfuscated Adversarial Training (OAT).

## Getting Started

Each subdirectory contains its own README with detailed instructions for installation and running experiments. Please refer to:
- `inference_time_experiments/README.md`
- `train_time_experiments/README.md`

