# Breaking MCPs with Function Hijacking attacks: Novel threats for Function Calling and Agentic Models

***Anonymous authors***

![Alt text](assets/drawing_mcp_fh_attack_cropped_v3.svg)

## TL;DR
This project introduces **FH-attacks**, the first function hijacking attack that manipulates the tool selection process of agentic models to force the invocation of a specific, attacker-chosen function. We also introduce **FunSecBench**, a new dataset to assess the vulnerabilities of function calling models. Inspired by BFCL, our dataset aims to test function calling models and assess their vulnerabilities in triggering attacker-selected functions in different scenarios. The main goal is to execute inference on LLMs while testing their robustness against adversarial function calls.

### Contributions:
- **Function-Hijacking (FH) Attack:** systematically call a target function given a user query $q$, and a payload $P$.
- **FunSecBench:** Synthetic data augmentation on the BFCL dataset, created variations of the query using 3 different stategies (semmatic, argument, and intent variations).
- **Universal FH-attack:** Generating attacks working for multiple queries.

### Models supported:
- [ibm-granite/granite-3.2-2b-instruct](https://huggingface.co/ibm-granite/granite-3.2-2b-instruct)
- [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)
- [mistralai/Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)

### FH-Attacks and Baselines:
- Standard inference (without attack)
- Function-Injection \[Zero-Shot\] (generated seeing the template and prompt)
- Function-Injection \[Few-Shot\] (generated seeing all available functions and prompt)
- FH-attack (generated seeing prompt, and refining function name or description over n prompts)

## Directory Structure

```
├── datasets/
│   ├── data_augmentation/
│   │   ├── src/        
│   │   │   └── strategies.py           # Run the data-augmentation to construct FunSecBench
│   │   └── FunSecBench/
│   │     
│   ├── function_injection/             # Prompts used for function-injection baselines attacks
│   │   ├── src/        
│   │   │   └── prompt.py
│   │   └── Function_inj_attacks/
│   │     
│   └──  source.txt                     # Source and reference to download the BFCL dataset
│
├── example/                            # Contains 2 examples per model (BFCL) to run the hijacking
│   ├── llama                           # Contains 2 additional attacks on popular MCPs (GitHub and Slack)
│   ├── granite
│   ├── mistral
│   ├── run_bfcl.sh
│   ├── run_mcp.sh
│   ├── run.py
│   ├── utils.py
│   └── run.sh
│
├── src/
│   ├── code_nanoGCG/  
│   │   ├── fh_attack.py                # Simple FH-attack 
│   │   └── utils.py        
│   │
│   └── fh_attack.py                    # Simple FH-attack
│
├── LICENCE                             # MIT Licence  
└── README.md                    
```

## Installation

### 1. Clone the repository
```bash
git clone link_to_github_repository
cd FH_Attack_paper
```

### 2. Create a virtual environment
```bash
conda create -n fh_attack python=3.10
conda activate fh_attack
```

### 3. Install dependancies
```bash
pip install -r requirements.txt
```

## Retrieving the datasets

```dataset/source.txt``` contains the references from the BFCL dataset.

### 1. Samples

Create a folder ```dataset/original/```, and upload ```BFCL_v3_simple.json``` and ```BFCL_v3_multiple.json```

### 2. Labels

Create a folder ```dataset/ground_truth/```, and upload ```possible_answer/BFCL_v3_simple.json``` and ```possible_answer/BFCL_v3_multiple.json```


## Reproducing the attacks

### `src/run_simple.sh`
- Runs FH-attack on a supported **LLM** on samples of the **Berkley Function Calling dataset**.
- Produce monitoring of the attack for each epochs, and the adversary tokens to be inserted in the target function.

## Evaluation

### `example/run_bfcl.sh`
- Runs inference of an **LLM** using samples of the **Berkley Function Calling dataset**.
- Shows inference with/without attacks.

### `example/run_mcp.sh`
- Runs inference of an **LLM** using **MCPs**.
- Shows inference with/without attacks.

## Acknowledgments

This project builds upon ideas and code from prior work.  
We thank the authors of the following paper and repository:

- **Paper:** [Universal and Transferable Adversarial Attacks on Aligned Language Models](https://arxiv.org/abs/2307.15043)  
  ```bibtex
  @misc{zou2023universaltransferableadversarialattacks,
      title={Universal and Transferable Adversarial Attacks on Aligned Language Models}, 
      author={Andy Zou and Zifan Wang and Nicholas Carlini and Milad Nasr and J. Zico Kolter and Matt Fredrikson},
      year={2023},
      eprint={2307.15043},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2307.15043}
  }
  ```

- **Repository:** [GraySwanAI/nanoGCG](https://github.com/GraySwanAI/nanoGCG)  
  Parts of this implementation were adapted from their codebase.

## License

FH_attack is licensed under the MIT license.
