# NetBurst Preprocessing Pipeline (Anonymized)

This folder contains scripts for transforming raw network capture–derived parquet (or intermediate packet extractions) into modeling-ready time–series datasets used by NetBurst training and inference workflows.

All absolute paths have been replaced with placeholders (e.g. `/PATH`) for anonymized artifact review. Replace with your actual storage roots prior to execution.

## Summary of Scripts

| Script | Purpose | Key Arguments |
|--------|---------|---------------|
| `PcapToDf_multi_node.py` | Multi-node / MPI + threads conversion of PCAP files to parquet with selected packet / flow fields using `tshark`. | `pcap_dir`, `output_dir` |
| `PCAPtoDF_multi_node10.slurm` | SLURM wrapper invoking the above conversion script (edit to add scheduler directives + real paths). | (Edit script) |
| `BITimeseries.py` | Build fixed-interval (e.g., 1s) inbound/outbound byte count sequences from aggregated parquet. | `--bin_ms`, `--output_dir`, `--threshold`, `--finetuning` |
| `SparseTimeSeries.py` | Same as byte count pipeline but prunes to a max number of (ip, source_file) keys for compact datasets. | `--max_examples` plus BITimeseries args |
| `IBGTimeSeries.py` | Generates Inter-Event Interval (IEI) sequences (optionally convertible to ms). | `--bin_ms`, `--output_dir`, `--threshold`, `--finetuning` |
| `FanoFactorPlotAllDatasets.py` | Computes per-series Fano factor, CCDFs, autocorrelation metrics, and summary plots. | `--config`, `--out_dir`, `--xlog`, plus optional filters |
| `FanoFactorAnalysis*.slurm` | SLURM wrappers to run specific Fano factor analyses for IP / Service / Subnet groupings. | (Edit script) |
| `BI_1s_threshold_100.slurm`, `IBG_1s_threshold_100.slurm` | Example SLURM job scripts for 1s bin processing at 100 byte threshold. | (Edit script) |
| `Convert*ToParquet.py` | Utility converters for public benchmark time–series datasets (Electricity, ETT, Exchange, Taxi, Weather). | Dataset-specific flags |

## Common Data Assumptions

Upstream parquet (input to `BITimeseries.py`, `IBGTimeSeries.py`, etc.) should include columns:
- `ip`
- `source_file` (capture or session grouping)
- `service_port` (optional, used for IP+Service selection)
- Per-bin counters (depending on pipeline) or event timestamps

Scripts derive:
- `/24` subnet (`subnet`)
- Partition / hash assignments for distributed execution (`ip_hash`, `subnet_hash`)

## Distributed Execution Model

Environment variables set by SLURM / MPI:
- `SLURM_PROCID` → task rank used to shard IP / subnet groups
- `SLURM_NTASKS` → total tasks; used to modulo-hash IP or subnet to workers

Each script builds a Spark session with high memory caps (placeholders). Tune these to cluster limits.

## Example Usage

1. Convert PCAPs to parquet:
```bash
srun python PcapToDf_multi_node.py /ABS/RAW_PCAP_DIR /ABS/PCAP_PARQUET_OUT
```

2. Build 1-second byte count sequences with thresholding:
```bash
srun python BITimeseries.py --bin_ms 1000 --threshold 100 --output_dir /ABS/BYTE_TS_OUT
```

3. Sparse subset for experimentation:
```bash
srun python SparseTimeSeries.py --bin_ms 1000 --threshold 100 --max_examples 250000 --output_dir /ABS/SPARSE_TS_OUT
```

4. IEI sequences:
```bash
srun python IBGTimeSeries.py --bin_ms 1000 --threshold 100 --output_dir /ABS/IEI_TS_OUT
```

5. Fano factor & CCDF analysis:
```bash
srun python FanoFactorPlotAllDatasets.py --config datasets.yaml --out_dir /ABS/FANO_RES --xlog
```

## Performance Tips
- Ensure parquet files are splittable (avoid excessive small files; coalesce upstream).
- Increase Spark shuffle partitions only proportional to cluster size.
- For extremely large input, pre-filter irrelevant IP ranges earlier.

## Anonymization Guidance
All path literals are placeholders. Before public release, re-scan for stray absolute paths or organization identifiers.
