# NetBurst Training & Inference (Anonymized)

Distributed training, fine‑tuning, and inference utilities for NetBurst's Chronos-based token modeling of network traffic time–series.

Primary implementation lives in `NetBurst.py` which exposes multiple entry functions (invoked by passing the function name via a launcher script or by selecting the correct CLI block). Example SLURM launchers (`pretrain.sl`, `inference.sl`) are provided with anonymized placeholders.

## Key Subcommands / Entry Functions

| Function | Purpose | Typical Launch |
|----------|---------|----------------|
| `main` | Supervised / fine-tuning loop producing `boundaries.pkl` + `chronos_best.pt` | `torchrun --nproc_per_node=4 NetBurst.py main /PARQUET_ROOT --epochs 5 --save_dir /MODEL_OUT` |
| `inference_auto_regression` | Auto-regressive continuation over held-out horizons given context fraction | `torchrun --nproc_per_node=4 NetBurst.py inference_auto_regression /PARQUET_ROOT --model /MODEL_OUT --save_pkl results.pkl` |
| `get_representations_with_ip` | Extract latent representations per (ip, source_file) retaining identity | `python NetBurst.py get_representations_with_ip /PARQUET_ROOT --model /MODEL_OUT` |

(Other internal utilities perform bin boundary generation, Chronos pipeline wrapping, DDP setup, etc.)

## Artifacts

| Artifact | Produced By | Description |
|----------|-------------|-------------|
| `boundaries.pkl` | `main` | Serialized numeric bin boundaries used for discretization/token mapping. |
| `chronos_best.pt` | `main` | Fine-tuned Chronos model (state dict). |
| `<save_dir>/metrics.json` (if added downstream) | (future) | Aggregated training metrics. |
| `AR_RESULTS.pkl` (custom name) | `inference_auto_regression` | List/dict of per-series predicted continuations and metadata. |

## Core Arguments (Selected)

### `main`
- `parquet_root` (positional): Root containing parquet time–series files (recursive read).
- `--epochs` (int): Training epochs (default 5).
- `--batch_size` (int): Global batch size per process (default 16).
- `--train_frac` (float): Train split fraction (default 0.8).
- `--max_len` (int, optional): Cap usable series length.
- `--save_dir` (str): Output directory for artifacts.
- `--retrain` (str): Path to prior checkpoint to resume / extend.

### `inference_auto_regression`
- `parquet_root`: Input parquet root.
- `--model`: Directory containing `boundaries.pkl` + `chronos_best.pt`.
- `--ips_csv`: CSV with allowed `(ip, source_file)` pairs (optional).
- `--min_ctx`, `--min_h`: Enforce minimum context / horizon lengths.
- `--nonhierarchical`: Disable hierarchical parquet assumptions.

### `get_representations_with_ip`
- `parquet_root`: Input parquet root.
- `--model`: Fine-tuned model directory.
- `--ips_csv`: Filter list for IP scoping.
- `--output_csv`: Destination CSV for embeddings/representations.

## Launch Examples

### Pretraining / Fine-Tuning
```bash
torchrun --nproc_per_node=4 NetBurst.py main /ABS/PARQUET_ROOT \
  --epochs 10 \
  --batch_size 32 \
  --save_dir /ABS/MODEL_OUT \
  --max_len 512
```

### Auto-Regressive Inference
```bash
torchrun --nproc_per_node=4 NetBurst.py inference_auto_regression /ABS/PARQUET_ROOT \
  --model /ABS/MODEL_OUT \
  --save_pkl ar_outputs.pkl \
  --ctx_frac 0.75 \
  --batch_size 32
```

### Representation Extraction
```bash
python NetBurst.py get_representations_with_ip /ABS/PARQUET_ROOT \
  --model /ABS/MODEL_OUT \
  --output_csv ip_reprs.csv
```

## Distributed Training Notes
- Uses PyTorch DDP (NCCL backend). The script initializes with `init_method="env://"`; ensure launcher (`torchrun` or SLURM) exports `RANK`, `WORLD_SIZE`, `LOCAL_RANK`.
- Set `--nproc_per_node` to your GPU count; multi-node training requires additional rendezvous configuration (not included here).
- Increase `--epochs` and tune `--batch_size` based on GPU memory; enable AMP (future enhancement) for memory savings.

## Data Format Expectations
Each parquet file should encode at minimum the tokenizable numeric sequences or arrays needed by the Chronos tokenizer plus identifiable keys (`ip`, `source_file`). Some routines expect arrays already aggregated/bucketed; preprocessing scripts in `../preprocess` create suitable structures.

## Recommended Directory Layout
```
DATA_ROOT/
  parquet/           # aggregated series
  model_out/         # saved fine-tuned artifacts
  inference_runs/    # AR *.pkl outputs
```

## Reproducibility
- Control randomness: set global seeds for Python, NumPy, Torch (extend code if deterministic runs required).
- Persist the exact commit hash and base Chronos model tag.
- Keep `boundaries.pkl` version-aligned with tokenizer changes.

## Anonymization Reminder
All explicit organization, user, and host identifiers removed. Replace placeholders intentionally; re-run a path scan before public artifact submission.
