# Language Models are Injectice and Hence Invertible

## SLURM Setup and Reproducibility

### Environment, Models, and Datasets

#### Create Directories
```sh
mkdir -p $SCRATCH/SIP-It/data
ln -s "$SCRATCH/SIP-It/data" ./data

export HF_HOME=$SCRATCH/hf_cache
mkdir -p "$HF_HOME"

# Add your Hugging Face token here
export HF_TOKEN=...
````

#### Download Models

We recommend caching models in advance to avoid repeated downloads on HPC systems.

```sh
python3.11 - <<'PY'
from huggingface_hub import snapshot_download

# === LIST OF REPOS TO CACHE ===
REPOS = [
    "roneneldan/TinyStories-1M",
    "roneneldan/TinyStories-8M",
    "roneneldan/TinyStories-33M",
    "openai-community/gpt2",
    "openai-community/gpt2-medium",
    "openai-community/gpt2-large",
    "google/gemma-3-1b-pt",
    "google/gemma-3-4b-pt",
    "google/gemma-3-12b-pt",
    "microsoft/Phi-4-mini-instruct",
    "mistralai/Mistral-7B-v0.1",
    "meta-llama/Llama-3.1-8B",
    "microsoft/deberta-v3-base",
]

# Download full snapshots into the HF cache
for repo in REPOS:
    print(f"\n>>> Downloading: {repo}")
    snapshot_download(
        repo_id=repo,
        local_dir=None,                # store in cache (default)
        local_dir_use_symlinks=False,  # real files (safer on many HPC systems)
        # Optional: revision="..." for specific commits
        # Optional: allow_patterns=[...] to save space
    )
print("\nAll requested repos cached.")
PY
```

#### Software Environment

* **Python Version:** 3.11.7
* **CUDA Version:** 12.2

To create and activate the virtual environment:

```sh
python3.11 -m venv .sipit_py3_11_7
source .sipit_py3_11_7/bin/activate
pip install -r requirements.txt
```

#### Fetch Datasets

```sh
HF_HUB_OFFLINE=0 python3.11 src/ablations/create_datasets.py
```

---

### Experiments

All experiments were run on one of the following GPU:

```sh
$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM-64GB           On  | 00000000:C8:00.0 Off |                    0 |
| N/A   43C    P0              61W / 461W |      2MiB / 65536MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
```

#### Dataset Experiments

```sh
./scripts/dataset_exp.sh
./scripts/find_close.sh --top-k 101000 --time 10:00:00
python3.11 ./src/ablations/verify_no_collisions.py \
--csv-path data/dataset_exp/<model name> \
--model-id <model id> --pattern *.csv
```

#### Exhaustive Experiments

```sh
./scripts/exhaustive.sh
./scripts/find_close_exhaustive.sh
python3.11 ./src/ablations/verify_no_collisions.py \
--csv-path data/dataset_exp/<model name> \
--model-id <model id> --pattern *.csv
```

#### Sequence Length Experiments

```sh
./scripts/seq_length.sh
./scripts/find_close_seq.sh
python3.11 ./src/ablations/verify_no_collisions.py \
--csv-path data/dataset_exp/<model name> \
--model-id <model id> --pattern zeros*
```

#### SIP-It Accuracy Experiments

```sh
./scripts/sip-it-correctness.sh
```

#### SIP-It Baselines Experiments

```sh
./scripts/sip-it-exhaustive.sh
./scripts/pes.sh
```

#### SIP-It Ablations Experiments

```sh
./scripts/sip-it-layer-ablation.sh
./scripts/sip-it-random.sh
```

