# PathFMTools

Computational pathology tools for whole‑slide image (WSI) processing: segmentation → tile extraction → feature and zero‑shot embeddings → analysis and visualization. The library centers on robust per‑slide HDF5 artifacts, a pluggable embedding‑model registry, and scalable multi‑process/multi‑GPU dispatch.

- End‑to‑end WSI pipeline with Typer CLIs
- Foreground segmentation (Otsu by default) and non‑overlapping patch grid
- Per‑slide HDF5 store with atomic writes and schema validation
- Embedding model registry (CONCH, MUSK, Virchow, UNI, Hibou, Midnight, H‑Optimus, PhiKon)
- Zero‑shot text alignment, text embedding cache, and patch↔text scoring
- Analysis helpers: concatenation across slides, clustering, and visualizations


## Installation

Requirements
- Python 3.12+
- Optional CUDA GPU for acceleration

Notes
- Model weights are fetched from model hubs (Hugging Face/timm) on first use; pre‑cache for offline runs.
- Some models (e.g., MUSK, CONCH) are heavy; GPU is strongly recommended.


## Quick Start

Process a slide end‑to‑end on CPU (patchify + embeddings) and persist results:

```bash
python -m pathfmtools.cli.process_slides \
  --slide-path "/data/slides/*.svs" \
  --store-root /data/pathfmtools_store \
  --patch-size 224 \
  --model conch \
  --batch-size 64 \
  --n-workers 2
```

Multi‑GPU run (one long‑lived worker per device) with two models:

```bash
python -m pathfmtools.cli.process_slides \
  --slide-path "/data/slides/*.svs" \
  --store-root /data/pathfmtools_store \
  --gpu 0 --gpu 1 \
  --patch-size 224 \
  --model conch --model musk \
  --batch-size 128 \
  --delete-tiles
```

Tips
- When slides require preprocessing, `--patch-size` is required (especially in multi‑device mode) to ensure deterministic behavior.
- Use `--help` to see a table of model capabilities (dims, zero‑shot support).


## Command‑Line Interfaces

All CLIs are Python modules (no console scripts), so invoke via `python -m ...`.

- Slide processing: segmentation → patch extraction → embeddings

  ```bash
  python -m pathfmtools.cli.process_slides --help
  
  # Example (CPU)
  python -m pathfmtools.cli.process_slides \
    --slide-path "/data/slides/*.svs" \
    --store-root /tmp/pathfmtools_store \
    --patch-size 224 \
    --batch-size 64 \
    --segmenter otsu \
    --model conch
  
  # Example (GPU 0 with auto device parsing)
  python -m pathfmtools.cli.process_slides \
    --slide-path "/data/slides/*.svs" \
    --store-root /tmp/pathfmtools_store \
    --gpu 0 \
    --patch-size 224 \
    --model conch --model musk \
    --batch-size 128 \
    --skip-zeroshot-embeddings  # if only feature embeddings are needed
  ```

  Key options
  - `--slide-path`: path or glob of WSIs
  - `--store-root`: output directory for per‑slide HDF5 files
  - `--gpu <id>`: target CUDA devices (repeatable); omit to run on CPU
  - `--patch-size`: integer patch size (required if preprocessing is needed)
  - `--model <name>`: one or more embedding models (see capabilities table)
  - `--delete-tiles`: drop stored tile pixels after embedding to save space
  - `--no-auto-rescale`: disable magnification‑aware automatic rescaling
  - `--skip-feature-embeddings` / `--skip-zeroshot-embeddings`: control which outputs are saved

- Text embedding cache builder (for zero‑shot classification):

  ```bash
  python -m pathfmtools.cli.embed_text \
    --model-name conch \
    --text-fpath ./classes.txt \
    --device cpu \
    --out-dir ./cache_dir
  ```

  Outputs `<out-dir>/<model>_text_embeddings.h5` and saves run info under `<out-dir>/run_info/`.


## Python API

Slide preprocessing and embedding:

```python
from pathlib import Path
import torch
from pathfmtools.image import Slide
from pathfmtools.embedding_models import get_embedding_model

store = Path("/tmp/pathfmtools_store")
slide = Slide(slide_path=Path("/data/s1.svs"), store_root=store)

# Segmentation + tiling (writes tiles, metadata, segmentation mask)
slide.preprocess(patch_size=224, segmenter="otsu")

# Patch embeddings (and zero‑shot if supported by the model)
Model = get_embedding_model("conch")
model = Model(device=torch.device("cuda:0"))  # or cpu
slide.embed_tiles(model, batch_size=64)
```

Zero‑shot patch classification:

```python
from pathlib import Path
import torch
from pathfmtools.analysis.zeroshot_classification import ZeroShotPatchClassifier

zc = ZeroShotPatchClassifier(text_embedding_cache_fpath=Path("./text_cache.h5"))
out = slide.run_zeroshot_classification(
    zero_shot_classifier=zc,
    model_name="conch",
    text_list=["tumor", "stroma", "necrosis"],
    device=torch.device("cpu"),
)
# out["logits"]["tumor"], out["probabilities"]["stroma"], out["class_predictions"]
```

Concatenate embeddings across slides and cluster:

```python
from sklearn.cluster import KMeans
from pathfmtools.image import SlideGroup

slides = [slide1, slide2, slide3]
sg = SlideGroup(slides)
arr = sg.get_concatenated_embedding_array(
    model_name="conch",
    embedding_type="patch_feature_embeddings",
)
km = KMeans(n_clusters=20, random_state=0).fit(arr)
labels = km.labels_
# Map results back to source slides/patches
vals_by_slide = sg.map_vals_to_source_patches(labels, "patch_feature_embeddings")
```


## Data Model (HDF5)

Each processed slide has one HDF5 file under `store_root` named `<slide_id>.h5` with a validated schema version.

Top‑level keys (when present):
- `slide_metadata` — JSON: slide dimensions, magnification, patch size, original path, segmentation stats
- `tile_segmentation_mask` — boolean grid over the patch grid
- `tile_metadata` — group of per‑tile arrays: row/col, top‑left x/y, width/height (square)
- `tiles` — optional RGB uint8 tiles `(N, P, P, 3)`
- `tile_embeddings/{canon}/feature` — float16 `(N, D)` patch feature embeddings
- `tile_embeddings/{canon}/zeroshot` — float16 `(N, Dz)` patch zero‑shot embeddings
- `tile_embeddings/{canon}/meta` — JSON with model info and dims

See `pathfmtools/io/schema.py` for authoritative names/dtypes and `SlideDataStore` for IO behavior (atomic writes, locking, schema checks).


## Embedding Models

Registered models (subject to change; run the CLI `--help` for a live table):
- `conch` (feature + zero‑shot + text)
- `musk` (feature + zero‑shot + text)
- `virchow` (feature)
- `uni` (feature)
- `hibou` (feature)
- `midnight` (feature)
- `h_optimus` (feature)
- `phikon` (feature)

Programmatic access:

```python
from pathfmtools.embedding_models import get_embedding_model, list_available_models, get_capabilities
Model = get_embedding_model("conch")
print(list_available_models())           # ["conch", "musk", ...]
print(get_capabilities("conch"))         # dims and zero‑shot/text support
```

Extend by decorating a subclass of `EmbeddingModel` with `@register_model("name", ...)`.


## Reproducibility and Performance

- Deterministic by default where feasible.
- Multiprocessing uses one long‑lived worker per device; patch size must be explicit when preprocessing.
- To minimize storage, run with `--delete-tiles` after embeddings are written; metadata and embeddings remain.


## Troubleshooting

- CUDA device parsing: `cpu`, `cuda`, `cuda:0`, or integers are accepted; errors are raised if unavailable/out of range.
- Zero‑shot only requested but model doesn’t support it: the CLI filters or errors accordingly.