# GOTabPFN (GO-LR + PIDFSegPCA + TabPFN-2.5) — Optuna 5×5 CV Runner

This repo contains:
- **GO-LR / GraphFeatureOrdering**: clusters samples, builds per-cluster feature graphs on CPU, produces a global feature ordering `Pi_star`.
- **PIDFSegPCA** (`pidf_segpca`): reorders features by `Pi_star`, segments them, then **fits per-segment PC1 on the training split only** to produce compressed tokens.
- **TabPFN25Head**: wraps TabPFN-2.5 classifier in a non-differentiable sklearn-style head.

The provided Optuna script tunes:
- GO-LR hyperparameters (metric, clusters, refine passes, direction selection)
- PIDFSegPCA hyperparameters (segmentation, M-rule, tau, gamma/beta, bounds, `l_min`, standardization)
- TabPFN internal random seed

…and evaluates by **RepeatedStratifiedKFold (5 splits × 5 repeats = 25 folds)** for each trial.

---

## 1) Repo layout (suggested)

```
.
├── gotabpfn.py
├── GOTabPFN_ALLAML.ipynb
 |----- GOTabPFN_Arcene.ipynb
 |----- GOTabPFN_Colon_exp.ipynb
 |----- GOTabPFN_Lung.ipynb
 |----- GOTabPFN_SMK.ipynb
 |----- GOTabPFN_TOX.ipynb
├── requirements.txt
└── README.md
```

Where:
- `gotabpfn.py` is your module file (GraphFeatureOrdering + pidf_segpca + TabPFN25Head).
- `run_optuna_allaml.py` is a Python script containing your “ONE CELL (ALLAML)” code.
- `ALLAML_combined_encoded.csv` is your dataset.

> If your dataset is elsewhere, you can keep it outside the repo; just update `DATA_FILE`.

---

## 2) Dataset format

Your CSV must contain:
- One column named exactly `Label` (or update `TARGET_COL`)
- All other columns are numeric features (already encoded)

Example:
- `Label` in {0,1} (binary) or any two distinct values that can be remapped.
- Features can be float/int.

---

## 3) Environment setup

### Option A — Conda (recommended)
```bash
conda create -n gotabpfn python=3.10 -y
conda activate gotabpfn
pip install -r requirements.txt
```

### Option B — venv
```bash
python3.10 -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -r requirements.txt
```

---

## 4) GPU notes (important)

Your runner uses:
```python
GPU_ID = 4
torch.cuda.set_device(GPU_ID)
device_str = f"cuda:{GPU_ID}"
```

So if you want GPU 0, set:
```python
GPU_ID = 0
```

### CUDA visibility
If you use multi-GPU nodes and want to restrict visibility:
```bash
CUDA_VISIBLE_DEVICES=4 python run_optuna_allaml.py
```

If you do that, then inside Python set `GPU_ID = 0` (because “visible GPU #0” will map to physical GPU 4).

---

## 5) Running the Optuna tuner

1) Save your “ONE CELL (ALLAML)” code to `run_optuna_allaml.py`.

2) Run:
```bash
python run_optuna_allaml.py
```

You should see prints like:
- dataset shape, number of classes, class map
- GPU availability and selected device
- Optuna progress bar
- best trial metrics and parameters at the end

---

## 6) Key configuration knobs

Inside `run_optuna_allaml.py` you will commonly change:

### Trials / compute budget
```python
N_TRIALS = 150
```

### Dataset / label column
```python
DATA_FILE = "ALLAML_combined_encoded.csv"
TARGET_COL = "Label"
```

### Device
```python
GPU_ID = 4
```

### TabPFN task type
For ALLAML you set:
```python
task_type="binary"
```
If you run a multiclass dataset, set:
```python
task_type="multiclass"
num_classes=<C>
```

---

## 7) What happens per Optuna trial (exactly)

For each trial:
1. **GO-LR is fit once** on **full** `X_all` → `Pi_star` (fixed for all folds of this trial)
2. For each CV fold:
   - PIDFSegPCA is **configured on training split only** (fits per-segment PC1)
   - Train/val are compressed to tokens `Z_tr, Z_va`
   - TabPFN is fit on `Z_tr` and evaluated on `Z_va`
3. Optuna prunes trials based on running mean accuracy (median pruner)

---

## 8) Determinism / reproducibility

You already do:
- global seeding (`seed_everything(SEED)`)
- `GraphFeatureOrdering._set_seed` includes `torch.use_deterministic_algorithms(True, warn_only=True)`

Notes:
- Full determinism can still vary slightly across different CUDA/cuDNN versions and hardware.
- TabPFN may have minor nondeterminism depending on version/build; you already tune/choose a `tabpfn_seed`.

---

## 9) Troubleshooting

### A) `ModuleNotFoundError: kmeans_gpu`
Your GO-LR supports CPU fallback KMeans:
- If you don’t have a working `kmeans_gpu` module, you have two options:

**Option 1 (simple): force CPU kmeans**
In `objective`, call:
```python
Pi_star, _, _, _ = go.fit(X_all, seed=SEED, deterministic=True, use_cpu_kmeans=True)
```

**Option 2: provide/install a GPU kmeans**
Ensure `kmeans_gpu.py` exists and exposes:
```python
from kmeans_gpu import KMeans as KMeansGPU
```
(or install whatever package you used originally and adjust the import accordingly).

### B) CUDA OOM during GO-LR or TabPFN
- Reduce `go_num_clusters` range
- Reduce `N_TRIALS`
- Run on CPU for GO-LR: `use_cpu_kmeans=True`
- Ensure `cleanup_cuda()` is called (you already do)
- Try setting:
```bash
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
```
(you already set it in code)

### C) TabPFN v2.5 version mismatch
Your wrapper tries:
```python
TabPFNClassifier.create_default_for_version(ModelVersion.V2_5)
```
and falls back to `TabPFNClassifier(device=...)` if unavailable.

If you get errors around `ModelVersion.V2_5`:
- upgrade `tabpfn` (recommended), or
- rely on the fallback path.

---

## 10) Expected outputs

At the end of the run, you’ll get:
- best mean accuracy across 25 folds
- std dev across folds
- best hyperparameters

You can also add Optuna storage to persist results (SQLite), e.g.:
```python
study = optuna.create_study(
    direction="maximize",
    sampler=sampler,
    pruner=pruner,
    storage="sqlite:///optuna_gotabpfn_allaml.db",
    study_name="gotabpfn_allaml",
    load_if_exists=True,
)
```

---

## 11) Minimal “sanity run” (quick test)

Before a full tune, set:
```python
N_TRIALS = 2
```
and optionally reduce folds:
```python
rkf = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=SEED)
```
