# Self-Instruct Pipeline

Complete pipeline for generating training data using self-instruction on Hugging Face model cards.

## Prerequisites

1. **Virtual environment** with all dependencies installed
2. **`.env` file** configured in the project root

## Required Environment Variables

The `.env` file must contain:

```bash
WANDB_API_KEY="your_wandb_key"
PARENT_ROOT=/path/to/project/root
DATA_PATH=/path/to/project/root/data/
TASKS_FILE=/path/to/project/root/self-instruct/data/tasks
SELF_INSTRUCT_ROOT_DATA=/path/to/project/root/self-instruct/data

# File names for each step
FILE_NAME_MODEL_CARDS_STEP_1=raw_model_cards.jsonl
FILE_NAME_LEGACY_MODEL_CARDS_STEP_2=raw_legacy_model_cards.jsonl
FILE_NAME_MODEL_CARDS_GENERATED_STEP_3=cleaned_model_cards.jsonl
FILE_NAME_LEGACY_MODEL_CARDS_GENERATED_STEP_3=cleaned_legacy_model_cards.jsonl
FILE_NAME_CLEANED_MODEL_CARDS_STEP_4=validated_model_cards.jsonl
FILE_NAME_LEGACY_CLEANED_MODEL_CARDS_STEP_4=validated_legacy_model_cards.jsonl
FILE_NAME_SELF_INSTRUCTED_MODELS_STEP_5=model_cards_with_queries.jsonl
FILE_NAME_SELF_INSTRUCTED_LEGACY_MODELS_STEP_5=legacy_model_cards_with_queries.jsonl
FILE_NAME_SELF_INSTRUCTED_MODELS_STEP_6=final_model_cards_with_queries.jsonl
FILE_NAME_SELF_INSTRUCTED_LEGACY_MODELS_STEP_6=final_legacy_model_cards_with_queries.jsonl
```

## Pipeline Overview

### Step 0: Setup

```bash
cd self-instruct/code
mkdir -p logs
```

### Step 1: Download Model Cards (Notebook)

**File**: `step_1_download_model_hub.ipynb`

**Description**: Downloads recent model cards from Hugging Face Hub using the API.

**Execution**:
1. Open the notebook in Jupyter or VS Code
2. Ensure `.env` is loaded in the first cell:
   ```python
   from dotenv import load_dotenv
   load_dotenv("/path/to/project/.env")
   ```
3. Run all cells

**Output**: `$SELF_INSTRUCT_ROOT_DATA/$FILE_NAME_MODEL_CARDS_STEP_1` (raw_model_cards.jsonl)

**Notes**: 
- Filters models by specific tasks
- Removes auto-generated model cards
- Saves in JSONL format

---

### Step 2: Download Legacy Model Cards (Notebook)

**File**: `step_2_download_model_hub_legacy_models.ipynb`

**Description**: Downloads model cards from older models (pre-2025) from Hugging Face Hub.

**Execution**:
1. Open the notebook
2. Ensure `.env` is loaded in the first cell:
   ```python
   from dotenv import load_dotenv
   load_dotenv("/path/to/project/.env")
   ```
3. Run all cells

**Output**: `$SELF_INSTRUCT_ROOT_DATA/$FILE_NAME_LEGACY_MODEL_CARDS_STEP_2` (raw_legacy_model_cards.jsonl)

**Notes**:
- Filters legacy models with creation date < 2025
- Removes duplicate model cards

---

### Step 3: Clean Model Cards (Python Script)

**File**: `step_3_clean_model_cards.py`

**Description**: Uses vLLM + LLM (Qwen) to clean and normalize model cards by removing tables, code blocks, and markdown artifacts.

**Execution**:

For **recent models**:
```bash
python step_3_clean_model_cards.py \
    --model_cards_path $SELF_INSTRUCT_ROOT_DATA/$FILE_NAME_MODEL_CARDS_STEP_1 \
    --output_path $SELF_INSTRUCT_ROOT_DATA/$FILE_NAME_MODEL_CARDS_GENERATED_STEP_3 \
    --model_path /path/to/llm/model
```

For **legacy models**:
```bash
python step_3_clean_model_cards.py \
    --model_cards_path $SELF_INSTRUCT_ROOT_DATA/$FILE_NAME_LEGACY_MODEL_CARDS_STEP_2 \
    --output_path $SELF_INSTRUCT_ROOT_DATA/$FILE_NAME_LEGACY_MODEL_CARDS_GENERATED_STEP_3 \
    --model_path /path/to/llm/model
```

**Required Parameters**:
- `--model_cards_path`: Path to input JSONL file with raw model cards
- `--output_path`: Path to save cleaned model cards
- `--model_path`: Path to the LLM model directory

**Output**: 
- Recent models: `$SELF_INSTRUCT_ROOT_DATA/$FILE_NAME_MODEL_CARDS_GENERATED_STEP_3` (cleaned_model_cards.jsonl)
- Legacy models: `$SELF_INSTRUCT_ROOT_DATA/$FILE_NAME_LEGACY_MODEL_CARDS_GENERATED_STEP_3` (cleaned_legacy_model_cards.jsonl)

### Step 4: Manual Cleaning (Notebook)

**File**: `step_4_manual_cleaning_model_cards.ipynb`

**Description**: Manual cleaning and validation of processed model cards. Removes problematic entries.

**Configuration**:
In the first cell, **manually change** the environment variable name based on which dataset you want to process:

**For recent models**:
```python
path_model_cards_generated = os.path.join(
    os.getenv("SELF_INSTRUCT_ROOT_DATA"),
    os.getenv("FILE_NAME_MODEL_CARDS_GENERATED_STEP_3"), 
)
```

**For legacy models**:
```python
path_model_cards_generated = os.path.join(
    os.getenv("SELF_INSTRUCT_ROOT_DATA"),
    os.getenv("FILE_NAME_LEGACY_MODEL_CARDS_GENERATED_STEP_3"),  # Input: cleaned_legacy_model_cards.jsonl
)
```

Also update the output file in the last cell:

**For recent models**:
```python
cleaned_file = os.path.join(
    os.getenv("SELF_INSTRUCT_ROOT_DATA"),
    os.getenv("FILE_NAME_CLEANED_MODEL_CARDS_STEP_4"),  # Output: validated_model_cards.jsonl
)
```

**For legacy models**:
```python
cleaned_file = os.path.join(
    os.getenv("SELF_INSTRUCT_ROOT_DATA"),
    os.getenv("FILE_NAME_LEGACY_CLEANED_MODEL_CARDS_STEP_4"),  # Output: validated_legacy_model_cards.jsonl
)
```

**Execution**:
1. Open the notebook
2. **Important**: Manually edit the environment variable names in the first and last cells
3. Run all cells
4. Manually review outputs and remove incomplete or malformed model cards
5. Repeat with the other variable names if you have both recent and legacy models

**Output**: 
- `$SELF_INSTRUCT_ROOT_DATA/$FILE_NAME_CLEANED_MODEL_CARDS_STEP_4` (validated_model_cards.jsonl)
- `$SELF_INSTRUCT_ROOT_DATA/$FILE_NAME_LEGACY_CLEANED_MODEL_CARDS_STEP_4` (validated_legacy_model_cards.jsonl)

---

### Step 5: Self-Instruct Generation (Python Script)

**File**: `step_5_self_instruct.py`

**Description**: Generates 20 diverse user queries for each model using self-instruction with vLLM.

**Execution**:

For **recent models**:
```bash
python step_5_self_instruct.py \
    --model_cards_path $SELF_INSTRUCT_ROOT_DATA/$FILE_NAME_CLEANED_MODEL_CARDS_STEP_4 \
    --output_path $SELF_INSTRUCT_ROOT_DATA/$FILE_NAME_SELF_INSTRUCTED_MODELS_STEP_5 \
    --model_path /path/to/llm/model
```

For **legacy models**:
```bash
python step_5_self_instruct.py \
    --model_cards_path $SELF_INSTRUCT_ROOT_DATA/$FILE_NAME_LEGACY_CLEANED_MODEL_CARDS_STEP_4 \
    --output_path $SELF_INSTRUCT_ROOT_DATA/$FILE_NAME_SELF_INSTRUCTED_LEGACY_MODELS_STEP_5 \
    --model_path /path/to/llm/model
```

**Required Parameters**:
- `--model_cards_path`: Path to input JSONL file with validated model cards
- `--output_path`: Path to save model cards with generated queries
- `--model_path`: Path to the LLM model directory

**Output**:
- Recent models: `$SELF_INSTRUCT_ROOT_DATA/$FILE_NAME_SELF_INSTRUCTED_MODELS_STEP_5` (model_cards_with_queries.jsonl)
- Legacy models: `$SELF_INSTRUCT_ROOT_DATA/$FILE_NAME_SELF_INSTRUCTED_LEGACY_MODELS_STEP_5` (legacy_model_cards_with_queries.jsonl)

### Step 6: Manual Cleaning Prompts (Notebook)

**File**: `step_6_manual_cleaning_prompts_self_instruct.ipynb`

**Description**: Manual cleaning of generated queries, removing duplicates and problematic queries.

**Configuration**:
In the first cell, **manually change** the environment variable name based on which dataset you want to process:

**For recent models**:
```python
file = os.path.join(
    os.getenv("SELF_INSTRUCT_ROOT_DATA"),
    os.getenv("FILE_NAME_SELF_INSTRUCTED_MODELS_STEP_5"),  # Input: model_cards_with_queries.jsonl
)
```

**For legacy models**:
```python
file = os.path.join(
    os.getenv("SELF_INSTRUCT_ROOT_DATA"),
    os.getenv("FILE_NAME_SELF_INSTRUCTED_LEGACY_MODELS_STEP_5"),  # Input: legacy_model_cards_with_queries.jsonl
)
```

Also update the output file in the last cell:

**For recent models**:
```python
output_file = os.path.join(
    os.getenv("SELF_INSTRUCT_ROOT_DATA"),
    os.getenv("FILE_NAME_SELF_INSTRUCTED_MODELS_STEP_6"),  # Output: final_model_cards_with_queries.jsonl
)
```

**For legacy models**:
```python
output_file = os.path.join(
    os.getenv("SELF_INSTRUCT_ROOT_DATA"),
    os.getenv("FILE_NAME_SELF_INSTRUCTED_LEGACY_MODELS_STEP_6"),  # Output: final_legacy_model_cards_with_queries.jsonl
)
```

**Execution**:
1. Open the notebook
2. **Important**: Manually edit the environment variable names in the first and last cells
3. Run all cells
4. Manually review and filter generated queries
5. Repeat with the other variable names if you have both recent and legacy models

**Output**:
- `$SELF_INSTRUCT_ROOT_DATA/$FILE_NAME_SELF_INSTRUCTED_MODELS_STEP_6` (final_model_cards_with_queries.jsonl)
- `$SELF_INSTRUCT_ROOT_DATA/$FILE_NAME_SELF_INSTRUCTED_LEGACY_MODELS_STEP_6` (final_legacy_model_cards_with_queries.jsonl)

---

### Step 7: Create Experience 3 & 4 Datasets (Notebook)

**File**: `step_7_create_exp_3_4.ipynb`

**Description**: Combines processed data and creates final datasets for Experience 3 and 4, with train/val/test splits.

**Execution**:
1. Open the notebook
2. Run all cells

**Output**: Final datasets for continual learning experiments (Experience 3 and 4).

**Notes**: Ensure to use `pd.to_datetime(..., utc=True)` to handle mixed timezones.

---



## Troubleshooting

### Error: `TypeError: join() argument must be str, bytes, or os.PathLike object, not 'NoneType'`

**Cause**: Environment variables not loaded in notebooks.

**Solution**: Add at the beginning of each notebook:
```python
from dotenv import load_dotenv
import os
load_dotenv("/path/to/project/.env")
```

### Error: CUDA Out of Memory

**Cause**: Model too large for available GPUs.

**Solutions**: 
- Reduce `--max_model_len` in the Python script
- Increase GPU count with `--gres=gpu:4` in SLURM script
- Use a smaller model
- Add `--tensor_parallel_size` flag to distribute model across GPUs

### Error: JSONDecodeError

**Cause**: Corrupted JSONL file or empty lines at the end.

**Solution**: Ensure JSONL file has no trailing empty newline:
```python
with open(output_file, "w") as f:
    for i, item in enumerate(items):
        if i < len(items) - 1:
            f.write(json.dumps(item) + "\n")
        else:
            f.write(json.dumps(item))  # no newline on last item
```

### Error: Mixed timezone values in pandas

**Cause**: DateTime columns with mixed timezone-aware and timezone-naive values.

**Solution**: Use `utc=True` when converting to datetime:
```python
pd.to_datetime(column, utc=True)
```

