# Robustness in Text-Attributed Graph Learning: Insights, Trade-offs, and New Defenses

## Table of Contents
- [Overview](#overview)
- [Installation](#installation)
- [Model Setup](#model-setup)
- [Quick Start](#quick-start)
- [Attack Generation](#attack-generation)
- [Defense Evaluation](#defense-evaluation)
- [LLM Training and Evaluation](#llm-training-and-evaluation)
- [Project Structure](#project-structure)

## Overview

The code for *Robustness in Text-Attributed Graph Learning: Insights, Trade-offs, and New Defenses*.

**Supported Attacks:**
- **Structural Attacks**: PGD, GRBCD, PRBCD, Metattack, STRG (Heuristic Attacks)
- **Text Attacks**: TextFooler, LLM-based attacks (GPT-4o-mini)
- **Hybrid Attacks**: WTGIA

**Supported Models:**
- **GNN Models**: GCN, GAT, GNNGuard, ElasticGNN, RobustGCN, GRAND, etc.
- **LLM Models**: InstructionTuning, GraphGPT, LLaGA with various LLMs (Mistral-7B, Qwen, Llama3)

**Datasets**: Cora, CiteSeer, PubMed, WikiCS, Instagram, Reddit, History, Photo, Computer, ArXiv

## Installation

### Step 0: Install Requirements

```bash
# Install PyTorch (adjust CUDA version as needed)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install core dependencies
pip install torch-geometric
pip install transformers accelerate
pip install sentence-transformers
pip install textattack
pip install openai
pip install scikit-learn numpy pandas tqdm pyyaml

# Use the provided GreatX library (modified version)
# The GreatX/ directory contains our custom modifications
```

## Model Setup

### Step 1: Download Language Models and LLMs

Update the model paths in `common/model_path.py`:

```python
MODEL_PATHs = {
    # Language Models (for text embeddings)
    "MiniLM": "/path/to/models/sentence-transformers--all-MiniLM-L6-v2/",
    "SentenceBert": "/path/to/models/sentence-transformers--multi-qa-distilbert-cos-v1/", 
    "e5-large": "/path/to/models/intfloat--e5-large-v2/",
    "roberta": "/path/to/models/sentence-transformers--all-roberta-large-v1/",
    
    # Large Language Models (for GraphGPT, LLaGA, InstructionTuning)
    "Mistral-7B": "/path/to/models/mistral-7B-Instruct",
    "Qwen-7B": "/path/to/models/Qwen--Qwen2.5-7B-Instruct",
    "Llama3-8B": "/path/to/models/llama-3.1-8B-Instruct/",
}
```

Please download the required models from Hugging Face or other sources and update the paths accordingly.

### Step 2: Download Datasets

Download datasets from either Google Drive or HuggingFace and unzip into the datasets folder:

**Option 1: Google Drive**
- Download: https://drive.google.com/file/d/14GmRVwhP1pUD_OIhoJU3oATZWTnklhPG/view
- Unzip to `/path/to/GraphAD_data/datasets/`

**Option 2: HuggingFace**  
- Download: https://huggingface.co/datasets/xxwu/LLMNodeBed/tree/main
- Unzip to `/path/to/GraphAD_data/datasets/`

### Step 3: Set Data Path

Update the data path in your scripts. The framework expects data at:
```
/path/to/GraphAD_data/
├── datasets/
│   ├── bow/           # BoW embeddings
│   ├── roberta/       # RoBERTa embeddings  
│   ├── MiniLM/        # MiniLM embeddings
│   └── vocab/         # Vocabulary files
└── saved_models/      # Trained model checkpoints
```

## Quick Start

### Step 4: Generate Embeddings

Generate text embeddings for all datasets and encoders:

```bash
cd Embedding/
bash gen_all.sh
```

This will generate embeddings for:
- **Encoders**: BoW, RoBERTa, MiniLM, Mistral-7B
- **Datasets**: All supported datasets (cora, citeseer, pubmed, etc.)

### Step 5: Generate Attacks

Navigate to the attacks directory:

```bash
cd attacks/
```

#### Structural Attacks

**PGD Attack (Inductive):**
```bash
python gen_attacks_inductive.py \
    --dataset cora \
    --ptb_rate 0.20 \
    --attack pgd \
    --emb_type bow \
    --device 0 \
    --re_split 2
```

**GRBCD Attack (Inductive):**
```bash
python gen_attacks_inductive.py \
    --dataset computer \
    --ptb_rate 0.20 \
    --attack grbcd \
    --emb_type bow \
    --device 0 \
    --re_split 2
```

**STRG Attack (Transductive):**
```bash
python gen_attacks_transductive.py \
    --dataset cora \
    --ptb_rate 0.30 \
    --attack strg \
    --emb_type bow \
    --threshold 0.5 \
    --device 0 \
    --re_split 1
```

**Batch Structural Attacks:**
```bash
# Edit datasets in run_structure_attacks.sh, then run:
bash run_structure_attacks.sh
```

#### Text Attacks

**TextFooler Attack (Inductive):**
```bash
python gen_text_attacks_inductive.py \
    --dataset cora \
    --ptb_rate 0.40 \
    --attack textfooler \
    --emb_type MiniLM \
    --device 0 \
    --re_split 2 \
    --seeds 3  # Generates seeds [0,1,2]
```

**TextFooler Attack (Transductive):**
```bash
python gen_text_attacks_transductive.py \
    --dataset cora \
    --ptb_rate 0.80 \
    --attack textfooler \
    --emb_type MiniLM \
    --device 0 \
    --re_split 1 \
    --seeds 3  # Generates seeds [0,1,2]
```

**LLM Attack with GPT-4o-mini (Inductive):**
```bash
python gen_text_attacks_inductive_llm.py \
    --dataset cora \
    --ptb_rate 0.40 \
    --attack gpt \
    --emb_type bow \
    --device 0 \
    --re_split 2 \
    --seeds 3 \
    --model_name "gpt-4o-mini"  # Generates seeds [0,1,2]
```

**LLM Attack with GPT-4o-mini (Transductive):**
```bash
python gen_text_attacks_transductive_llm.py \
    --dataset cora \
    --ptb_rate 0.80 \
    --attack gpt \
    --emb_type bow \
    --device 0 \
    --re_split 1 \
    --seeds 3 \
    --model_name "gpt-4o-mini"  # Generates seeds [0,1,2]
```


**WTGIA Attack (Advanced Hybrid):**
```bash
# Cora dataset
python gen_wtgia_inductive.py \
    --dataset cora \
    --emb_type bow \
    --injection atdgia \
    --n_inject 60 \
    --n_edges 20 \
    --sp_level 0.15 \
    --eval_robo \
    --verbose

# CiteSeer dataset  
python gen_wtgia_inductive.py \
    --dataset citeseer \
    --emb_type bow \
    --injection atdgia \
    --n_inject 90 \
    --n_edges 10 \
    --sp_level 0.15 \
    --eval_robo \
    --verbose

# PubMed dataset
python gen_wtgia_inductive.py \
    --dataset pubmed \
    --emb_type bow \
    --injection atdgia \
    --n_inject 400 \
    --n_edges 25 \
    --sp_level 0.15 \
    --eval_robo \
    --verbose \
    --batch_size 50
```

**Batch Text Attacks:**
```bash
# TextFooler batch processing
bash run_text_attacks_textfooler.sh

# LLM attacks batch processing  
bash run_text_attacks_llm_ind.sh    # Inductive
bash run_text_attacks_llm_trans.sh  # Transductive
```

### Complete Attack Generation Workflow

Use these existing batch scripts to generate all attacks systematically:

**Batch Structural Attacks:**
```bash
cd attacks/
# Edit run_structure_attacks.sh to configure datasets and attacks
bash run_structure_attacks.sh
```

**Batch Text Attacks:**
```bash
cd attacks/
# TextFooler attacks (inductive and transductive)
bash run_text_attacks_textfooler.sh

# LLM attacks  
bash run_text_attacks_llm_ind.sh    # Inductive
bash run_text_attacks_llm_trans.sh  # Transductive
```

**Guard Attacks (PGD with Cosine Similarity Thresholds):**
```bash
cd attacks/
# Generate PGD attacks with various cosine similarity thresholds for GNNGuard evaluation
bash run_guard_attacks.sh
```

**WTGIA Attacks (Individual Generation Required):**
```bash
#!/bin/bash
cd attacks/
# Cora
python gen_wtgia_inductive.py \
    --dataset cora \
    --emb_type bow \
    --injection atdgia \
    --n_inject 60 \
    --n_edges 20 \
    --sp_level 0.15 \
    --eval_robo \
    --verbose

# CiteSeer
python gen_wtgia_inductive.py \
    --dataset citeseer \
    --emb_type bow \
    --injection atdgia \
    --n_inject 90 \
    --n_edges 10 \
    --sp_level 0.15 \
    --eval_robo \
    --verbose

# PubMed
python gen_wtgia_inductive.py \
    --dataset pubmed \
    --emb_type bow \
    --injection atdgia \
    --n_inject 400 \
    --n_edges 25 \
    --sp_level 0.15 \
    --eval_robo \
    --verbose \
    --batch_size 50
```

### Step 6: Run GNN Defense Evaluation

Navigate to defenses directory and run evaluations:

```bash
cd defenses/
```

**Inductive Setting Evaluation:**
```bash
bash run_evaluation_inductive.sh
```

**Transductive Setting Evaluation:**
```bash
bash run_evaluation_transductive.sh
```

**Text Attack Defense Evaluation:**
```bash
bash run_evaluation_inductive_text.sh      # Inductive text attacks
bash run_evaluation_transductive_text.sh   # Transductive text attacks
```

**WTGIA Defense Evaluation:**
```bash
bash run_evaluation_wtgia.sh
```

**AutoGCN Defense Evaluation:**
```bash
# AutoGCN against structural attacks
bash run_auto_gcn.sh

# AutoGCN against text attacks  
bash run_auto_gcn_text.sh
```

**Individual Model Evaluation:**
```bash
python eval_inductive.py \
    --dataset cora \
    --model gcn \
    --attack pgd \
    --atk_emb_type roberta \
    --def_emb_type roberta \
    --ptb_rate 0.20 \
    --device 0
```

### Step 7: LLM Training and Evaluation

Navigate to LLM_scripts directory:

```bash
cd LLM_scripts/
```

#### InstructionTuning (SFT)

**Clean Training (Transductive):**
```bash
# Run for seeds 0,1,2
for seed in 0 1 2; do
    bash run_sft_trans.sh cora $seed Mistral-7B neighbor_label
done
```

**Clean Training (Inductive):**
```bash
# Run for seeds 0,1,2
for seed in 0 1 2; do
    bash run_sft_ind.sh cora $seed Mistral-7B neighbor
done
```

**Attack Training (Transductive):**
```bash
# Run for seeds 0,1,2
for seed in 0 1 2; do
    bash run_sft_atk_trans.sh cora gpt 0.8 $seed Mistral-7B bow neighbor_label
done
```

**Attack Training (Inductive):**
```bash  
# Run for seeds 0,1,2
for seed in 0 1 2; do
    bash run_sft_atk_ind.sh cora gpt 0.4 $seed Mistral-7B bow neighbor
done
```

**Auto Prompt Training (Inductive):**
```bash
# Run for seeds 0,1,2
for seed in 0 1 2; do
    bash run_sft_ind.sh cora $seed Mistral-7B auto
done
```

#### GraphGPT

**Clean Training (Transductive):**
```bash
# Run for seeds 0,1,2
for seed in 0 1 2; do
    bash run_graphgpt_trans.sh cora $seed Mistral-7B
done
```

**Clean Training (Inductive):**
```bash
# Run for seeds 0,1,2
for seed in 0 1 2; do
    bash run_graphgpt_ind.sh cora $seed Mistral-7B
done
```

**Attack Training (Transductive):**
```bash
# Run for seeds 0,1,2
for seed in 0 1 2; do
    bash run_graphgpt_atk_trans.sh cora gpt 0.8 $seed Mistral-7B bow
done
```

**Attack Training (Inductive):**
```bash
# Run for seeds 0,1,2
for seed in 0 1 2; do
    bash run_graphgpt_atk_ind.sh cora gpt 0.4 $seed Mistral-7B bow
done
```

#### LLaGA

**Clean Training (Transductive):**
```bash
# Run for seeds 0,1,2
for seed in 0 1 2; do
    bash run_llaga_trans.sh cora $seed noise 0 Mistral-7B
done
```

**Clean Training (Inductive):**
```bash
# Run for seeds 0,1,2  
for seed in 0 1 2; do
    bash run_llaga_ind.sh cora $seed noise 0 Mistral-7B
done
```

**Attack Training:**
```bash
# Run for seeds 0,1,2
for seed in 0 1 2; do
    bash run_llaga_atk_trans.sh cora gpt 0.8 $seed Mistral-7B bow noise
    bash run_llaga_atk_ind.sh cora gpt 0.4 $seed Mistral-7B bow noise
done
```


## Project Structure

```
code/
├── README.md                    # This file
├── Embedding/                   # Text embedding generation
│   ├── embedding.py            # Embedding generation script
│   └── gen_all.sh              # Batch embedding generation
├── attacks/                     # Attack generation
│   ├── gen_attacks_*.py        # Structural attack scripts
│   ├── gen_text_attacks_*.py   # Text attack scripts  
│   ├── gen_wtgia_*.py          # WTGIA hybrid attacks
│   ├── run_*.sh                # Batch attack scripts
│   └── text_attack.py          # Text attack utilities
├── defenses/                    # Defense evaluation
│   ├── eval_*.py               # Evaluation scripts
│   ├── run_evaluation_*.sh     # Batch evaluation scripts
│   └── config.yaml             # Defense configuration
├── LLM_scripts/                # LLM training scripts
│   ├── run_sft_*.sh            # InstructionTuning scripts
│   ├── run_graphgpt_*.sh       # GraphGPT scripts
│   └── run_llaga_*.sh          # LLaGA scripts
├── LLMPredictor/               # LLM model implementations
│   ├── InstructionTuning/      # SFT implementation
│   ├── GraphGPT/               # GraphGPT implementation
│   └── LLaGA/                  # LLaGA implementation
├── common/                     # Shared utilities
│   ├── dataloader.py           # Data loading
│   ├── model_path.py           # Model path configuration
│   └── *.py                    # Other utilities
└── GreatX/                     # Graph attack/defense library
```

## Acknowledgments

We acknowledge the following datasets and repositories:

- **LLMNodeBed Dataset**: We thank the authors for providing the comprehensive graph datasets and embeddings available at [HuggingFace](https://huggingface.co/datasets/xxwu/LLMNodeBed) and [Google Drive](https://drive.google.com/file/d/14GmRVwhP1pUD_OIhoJU3oATZWTnklhPG/view).

- **LLMNodeBed Repository**: We express our gratitude for the LLMNodeBed framework and codebase available at https://github.com/WxxShirley/LLMNodeBed.

- **GreatX Library**: This project uses the GreatX library for graph adversarial attacks and defenses. We acknowledge the original authors and maintainers of this excellent toolkit.

