# Stealing and Defending the Ends of LLMs

## Abstract

Soft prompt tuning has emerged as a powerful and automated approach for adapting large language models (LLMs) to new tasks, eliminating the need for manual prompt engineering. The practical relevance of soft prompts is underscored by their support in major toolkits and APIs such as NVIDIA NeMo and IBM Watsonx AI. However, as soft prompts encode valuable, task-specific information, they have become attractive targets for adversarial extraction. In this work, we demonstrate that attackers can extract functionally equivalent soft prompts from prompt-tuned LLMs, effectively replicating their capabilities without access to the original training data or resources. By training a dedicated inversion model, we show that such extraction generalizes, enabling recovery of soft prompts for any downstream task on the given model. To counter this threat, we introduce CAP (Coverage-Aware Perturbation), an active defense that substantially impairs extraction while maintaining task performance for legitimate use. Our framework highlights both new risks and practical solutions, paving the way for more trustworthy deployment of adapted LLMs.

---

## Experimental Setup

```bash
# Step 1: Create a new conda environment
conda create -n soft_prompt_env python=3.13

# Step 2: Activate the environment
conda activate soft_prompt_env

# Step 3: Ensure pip is installed
conda install pip

# Step 4: Install dependencies from requirements.txt
pip install -r requirements.txt


## Code Structure Notes

- **Distillation attack logic**: `src/attacks/distillation.py`
- **Inversion attack logic**: `src/attacks/inversion.py`
- **Coverage-Aware Perturbation (CAP) defense logic input end implemented in**: `src/attacks/distillation.py`. 
- **Last-layer weight extraction defense**: `src/defense/defense.py`
- **Prompt Length extraction attack**: `src/attacks/prompt_length_extraction.py`

---

## Steps to Run Experiments`

1. **Obtaining model outputs for distillation**  
   Set the hyperparameters in distillation.py like number of samples, target adapter checkpoints. Specify k to simulate partial distributional access   

2. **Performing distillation**  
   Set the number of epochs, learning rate, batch size for the distillation process.

3. **Training the Inversion Model**  
   Set the hyperparameters in inversion.py for inverting next token probability vectors obtained in stage 1 to extract soft prompts for other unseen downstream tasks. 

4. **CAP Defense**  
   Enable CAP defense by setting:
   - ENABLE_DEFENSE = True  
   Configure additional parameters like:
   - `hash_bits`
   - `batch_size`


5. **Defense Against Last Layer Extraction (output end of LLM)**  
   Set `ENABLE_DEFENSE: true` in the `defense.py` to compute the Root Mean Squared Error (RMSE) when the defense is enabled.

6. **Soft Prompt Length Extraction Attack (Timing-Based Side channel attack)**
   Set the hyperparameters in prompt_length_extraction.py like query text, batch size, number of repeating queries and load the tuned prompt to run the code.
