# NetBurst (Anonymized)

This directory contains preprocessing and training code for the NetBurst framework: large-scale network traffic time–series representation, modeling, and inference using tokenized byte-volume and inter-event interval (IEI) sequences.

All paths and dataset identifiers have been intentionally anonymized (e.g., `/PATH`, `/INPUT_DATA_PATH`, `ChronosModelPath`) for artifact / double‑blind review. Replace these placeholders with your own absolute or project-relative paths when reproducing.

## Contents

- `preprocess/` – Spark + MPI preprocessing pipelines to transform raw packet capture (PCAP)–derived parquet datasets into modeling-ready serialized time series (byte count, sparse, or IEI forms) and Fano factor analysis utilities.
- `train/` – Model training, pretraining, and inference scripts built around Chronos time-series token models plus custom binning, hierarchical aggregation, and auto-regressive (AR) inference utilities.

## Environment Overview

Core technologies:
- Python 3.10+ (recommend isolated virtualenv / conda)
- PyTorch (Distributed Data Parallel + `torchrun`)
- PySpark (high driver / executor memory requirements in examples are placeholders – tune to cluster)
- MPI / SLURM job arrays (for parallel preprocessing)
- Chronos time-series foundation models
- Auxiliary: `numpy`, `pandas`, `pyarrow`, `statsmodels`, `scikit-learn`, `matplotlib`

You will also need system tools for some stages:
- `tshark` (packet field extraction)
- GNU `parallel` (optional helper in some external scripts)

## Data Modalities Produced

| Modality | Script(s) | Description |
|----------|-----------|-------------|
| Byte Count Timeseries | `BITimeseries.py`, `SparseTimeSeries.py` | Aggregates bidirectional byte counts into fixed-width time bins (e.g., 100 ms → 1 s) with threshold filtering. Sparse variant reduces storage via pruning. |
| IEI (Inter-Event Interval) | `IBGTimeSeries.py` | Derives IEI sequences from activity bins, optionally converting to millisecond scale. |
| PCAP → Parquet Extraction | `PcapToDf_multi_node.py` + `PCAPtoDF_multi_node10.slurm` | Distributed conversion of raw PCAPs to columnar parquet with selected protocol fields. |
| Statistical Diagnostics | `FanoFactorPlotAllDatasets.py` + `FanoFactorAnalysis*.slurm` | Computes Fano factor distributions, CCDFs, and ACF diagnostics across series. |

## Preprocessing Workflow

1. (Optional) Extract packet-level fields to parquet (if starting from PCAP):
   ```bash
   srun python PcapToDf_multi_node.py /ABS/INPUT/PCAP_DIR /ABS/OUTPUT/PARQUET_DIR
   ```
2. Generate aggregated byte-count or IEI series (choose one or more):
   ```bash
   # Byte count (1 s bins example)
   srun python BITimeseries.py --bin_ms 1000 --output_dir /ABS/OUTPUT/BYTE_TS --threshold 100

   # Sparse reduction variant
   srun python SparseTimeSeries.py --bin_ms 1000 --output_dir /ABS/OUTPUT/SPARSE_TS --threshold 100 --max_examples 500000

   # IEI sequences
   srun python IBGTimeSeries.py --bin_ms 1000 --output_dir /ABS/OUTPUT/IEI_TS --threshold 100
   ```
3. (Optional) Fano factor analysis & plots:
   ```bash
   srun python FanoFactorPlotAllDatasets.py --config datasets.yaml --out_dir /ABS/OUTPUT/FANO --xlog
   ```

Placeholders `/PATH` inside preprocessing scripts should be replaced with the parquet root generated in step 1 or existing aggregated parquet storage.

## Training & Pretraining

Two primary entry points live in `train/NetBurst.py` (multiple subcommands implemented as functions) and are launched with `torchrun` for distributed multi-GPU training.

### 1. Pretraining (Token Boundary / Chronos Fine-Tuning)
Example minimal run (adjust batch size, model path, epochs):
```bash
torchrun --nproc_per_node=4 NetBurst.py main /ABS/PARQUET_ROOT \
  --epochs 5 \
  --save_dir /ABS/MODEL_OUT \
  --max_len 512
```
Key arguments (main):
- `parquet_root`: Root directory containing hierarchical parquet (supports recursive read).
- `--epochs`: Training epochs (default 5 for quick iteration).
- `--max_len`: Optional max usable sequence length filter.
- `--save_dir`: Where checkpoints + boundaries.pkl / chronos_best.pt are written.
- `--retrain`: Path to prior run directory to resume / further fine-tune.

### 2. Auto-Regressive Inference
Requires a trained model directory containing `boundaries.pkl` and `chronos_best.pt`.
```bash
torchrun --nproc_per_node=4 NetBurst.py inference_auto_regression /ABS/PARQUET_ROOT \
  --model /ABS/MODEL_OUT \
  --save_pkl AR_RESULTS.pkl \
  --ips_csv OptionalFilterIPs.csv \
  --nonhierarchical
```
Important arguments (inference_auto_regression):
- `parquet_root`: Input parquet root (hierarchical or flat).
- `--model`: Directory with trained artifacts.
- `--ips_csv`: Optional CSV restricting (ip, source_file) pairs.
- `--min_ctx`, `--min_h`: Guarantees minimal context & horizon lengths.
- `--nonhierarchical`: Treat input parquet structure as flat (no multi-level partition columns).

### 3. Representation Extraction with IP Preservation
```bash
python NetBurst.py get_representations_with_ip /ABS/PARQUET_ROOT \
  --model /ABS/MODEL_OUT \
  --ips_csv OptionalFilterIPs.csv \
  --output_csv ip_representations.csv
```
Stores per-series representations while retaining IP + source_file pairing.

## Distributed & Resource Notes

- Set `CUDA_VISIBLE_DEVICES` or rely on scheduler (SLURM) for GPU binding.
- `torchrun --nproc_per_node=N` spawns N processes; script internally uses NCCL backend (`dist.init_process_group`).
- Spark driver/executor memory in preprocessing scripts (`200g`) are placeholders—tune to your cluster limits.
- Ensure parquet reads use recursive file lookup if your layout is nested (modify `.option("recursiveFileLookup", "true")` where needed).

## File Placeholder Reference

| Placeholder | Replace With |
|-------------|--------------|
| `/PATH` | Your aggregated parquet dataset root |
| `/INPUT_DATA_PATH` | Root directory of raw PCAP split set or staged parquet |
| `ChronosModelPath` | Base pretrained Chronos checkpoint (e.g., `amazon/chronos-t5-small`) |
| `/BIPATH/`, `/IBG_PATH/` | Convenience environment- or SLURM-substituted dataset roots |

## SLURM / Batch Examples
Minimal array pattern (edit header directives as appropriate):
```bash
#!/bin/bash
#SBATCH -A <ACCOUNT>
#SBATCH -C gpu
#SBATCH -q regular
#SBATCH -t 02:00:00
#SBATCH -N 1
#SBATCH --gpus-per-node=4
#SBATCH --ntasks-per-node=4
#SBATCH -J netburst-pretrain

module load gpu cuda/12.2  # if required on your site

torchrun --nproc_per_node=4 NetBurst.py main /ABS/PARQUET_ROOT --epochs 10 --save_dir /ABS/MODEL_OUT
```

## Data Schema Expectations
Input parquet tables should contain at least:
- `ip` (string)
- `source_file` (string) – logical grouping or original capture identity
- `time_idx` or implicit ordering (some loaders rely on sorted sequence)
- `inbound_bytes`, `outbound_bytes` OR IEI-equivalent token/value arrays depending on modality

Scripts add derived columns (`subnet`, hashed partitions, filtered arrays, token bins). Adjust selectors if your schema differs.

## Reproducibility Tips
- Seed-deterministic splits in `main()` rely on Python `random.seed(42)`; extend to `numpy` & `torch` if stricter determinism needed.
- Store `boundaries.pkl` with the same tokenizer to keep bin semantics stable between training & inference.
- Track exact commit hash of Chronos base model and this repository for artifact reproducibility.

## Anonymization Statement
All user-specific paths, institution identifiers, and usernames were removed or replaced by neutral placeholders. If you discover residual identifiers, replace with a neutral token before distribution.