# iLoRA Fine-tuning on LLMs

Bayesian Low-Rank Adaptation with Latent Interaction Graphs for **Task1** (Machine Reading Comprehension on Molweni + Llama 3.1-8B) and **Task2** (Disease classification from microbiome data + Qwen3-8B).

---

## Setup

```bash
# Create environment
python -m venv .venv && source .venv/bin/activate
# Or: conda create -n ilora python=3.10 && conda activate ilora

# Install PyTorch (choose your CUDA version)
pip install --upgrade pip
pip install torch==2.5.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install torch-geometric extensions (match CUDA version to torch)
pip install torch-geometric pyg-lib torch-scatter torch-sparse torch-cluster torch-spline-conv \
  -f https://data.pyg.org/whl/torch-2.5.0+cu124.html

# Install remaining dependencies
pip install -r requirements.txt
```

**CUDA selection**: Adjust the torch installation URL based on your CUDA version:
- CUDA 11.8: `--index-url https://download.pytorch.org/whl/cu118`
- CUDA 12.1: `--index-url https://download.pytorch.org/whl/cu121`
- CPU only: `--index-url https://download.pytorch.org/whl/cpu`

---

## Task 1: Molweni MRC

1. **Download data**:
   ```bash
   git clone https://github.com/HIT-SCIR/Molweni.git Molweni-main
   ```

2. **Update `Task1/config.yaml`**:
   ```yaml
   model_name: "path/to/Meta-Llama-3.1-8B-Instruct"
   original_dataset_path: "../Molweni-main/MRC(withDiscourse)"
   save_dataset_path: "../Molweni_LLM_dataset"
   ```

3. **Create dataset**:
   ```bash
   python Task1/datasetCreateMolweni.py
   ```

4. **Train**:
   ```bash
   python Task1/main_ilora.py
   ```
   Checkpoints: `Task1/checkpoints/best_model_iLoRA`

---

## Task 2: IBD Classification

1. **Data** (included): `dataset/raw_data_uc_cd/` contains 8 cohorts.

   **Data sources** (BioProject IDs):
   | Cohort | BioProject | Link |
   |--------|-----------|------|
   | Ananthakrishnan_2017 | PRJNA384246 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA384246 |
   | Franzosa_2019B/N | PRJNA400072 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA400072 |
   | Khachatryan_2023 | PRJNA893901 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA893901 |
   | Kumbhari_2024 | PRJNA993675 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA993675 |
   | Lee_2021 | PRJNA685168 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA685168 |
   | Lloyd-Price_2019 | PRJNA398089 | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA398089 |
   | Ning_2023 | PRJCA017408 | https://www.ncbi.nlm.nih.gov/bioproject/PRJCA017408 |

   We keep one dataset here for fast testing purposes.

2. **Update `configs/yes_no_ilora_uc_cd.yaml`**:
   ```yaml
   model_name: "path/to/Qwen3-8B"
   save_dataset_path: "dataset/processed_data/IBD_UC_CD_yes_no"
   ```

3. **Create dataset**:
   ```bash
   python Task2/src/datasetCreate/yes_no_datasetCreate_uc_cd.py \
     --config_path configs/yes_no_datasetCreate_uc_cd.yaml
   ```

4. **Train**:
   ```bash
   python Task2/src/ilora/main_ilora.py \
     --config_path configs/yes_no_ilora_uc_cd.yaml
   ```
   Checkpoints: `outputs/yes_no_ilora_uc_cd/checkpoints/`

---

## Configuration

**Task1** (`Task1/config.yaml`): Model path, dataset paths.

**Task2** (`configs/yes_no_ilora_uc_cd.yaml`):
```yaml
lora:
  r: 16                      # LoRA rank
  lora_alpha: 32
  target_modules: [q_proj, v_proj, k_proj, o_proj]

training:
  batch_size: 2
  num_train_epochs: 6
  learning_rate: 0.0002
  ilora_loss_weight_laplace: 1e-3
  ilora_loss_weight_binomial: 1e-3
  use_ilora: true
```

---

## Troubleshooting

| Issue | Solution |
|-------|----------|
| `ModuleNotFoundError` | Run from repo root: `python Task1/main_ilora.py` |
| `CUDA out of memory` | Reduce `batch_size` or `max_seq_length` in config |
| Model not found | Ensure path in config is correct (absolute or repo-relative) |
| Data path errors | Verify `ls Molweni-main/MRC\(withDiscourse\)/` or `ls dataset/raw_data_uc_cd/` |

---

## Project Structure

```
├── Task1/               # Molweni MRC
├── Task2/src/          # IBD classification
├── configs/            # Task2 configs
├── dataset/            # Raw & processed data
├── run/ & utils/       # Training framework
└── requirements.txt
```

---

## Dependencies

See `requirements.txt`. Key packages: torch, transformers, peft, datasets, accelerate, scikit-learn.

---

## Attribution

Includes code from **Wang-ML-Lab/bayesian-peft** (MIT License). See `THIRD_PARTY_NOTICES.txt`.
