## System architecture and module breakdown
- Overall data flow
  - Inputs: Enterprise graphs (hosts as nodes, aggregated flows as edges) with temporal windows; features include flow counts/bytes/packets, inter-arrival stats, protocol mixes, degrees; simulation parameters (topology constraints, routing tables, QoS policies).
  - GNN encoder (GATv2): Computes node embeddings, edge logits, and attention explanation masks.
  - Simulation integration layer: Two options (a) in-the-loop micro-sim via ns3-gym/ns3-ai; (b) offline simulation cache plus learned surrogate for differentiable, low-latency feedback.
  - Explainer heads: Attention-based explanations by design; PGExplainer for faster test-time explanations; GNNExplainer for SOTA comparison.
  - Outputs: Attack/benign predictions; explanation masks; simulator KPIs (latency, drops, throughput) for training feedback; visualization artifacts.

- Data loader and preprocessing (PyTorch Geometric)
  - Interface:
    - class CiscoDataset(torch.utils.data.Dataset):
      - __init__(root, split_file, window_size_s=300, step_s=60, feature_cfg, label_mode)
      - __getitem__(idx) -> Data: attributes: x [N, F], edge_index [2, E], edge_attr [E, Fe], y_node [N] or y_edge [E], graph_id, time_window.
      - __len__()
    - Outputs normalized features (z-score per enterprise), time encodings (sin/cos hour-of-day), and masks for train/val/test by enterprise (14/4/4).
  - Feature engineering:
    - Node features: degree (in/out), clustering coeff (approx), protocol distribution histogram over window, total bytes in/out, failed conn ratio, temporal encodings.
    - Edge features: bytes, packets, inter-arrival mean/std/skew, duration, flow count, ratio of SYN without ACK, application port bins.
    - Imputation: per-enterprise medians; normalization: standardize per-feature; clip outliers at 99th percentile.

- GNN encoder (GATv2 backbone)
  - Interface:
    - class GATv2IDS(nn.Module):
      - __init__(in_dim_node, in_dim_edge, hidden=128, layers=3, heads=4, dropout=0.4, use_edge_attr=True, attn_explain_heads=2)
      - forward(data: Data) -> dict with:
        - node_logits: [N, C]
        - edge_logits: [E, C_edge] (optional)
        - node_emb: [N, H]
        - edge_attn: [E] or [E, attn_heads] normalized importance
        - node_attn: [N] optional attention importance
  - Design:
    - Use GATv2Conv with edge attributes via edge-conditioned attention (augment attention MLP with edge features) or concatenate edge feature transforms.
    - Include dedicated “explanation heads” (linear + sigmoid) over attention coefficients to produce sparse masks optimized by fidelity and sparsity losses.

- Simulation integration
  - Option (a) In-the-loop micro-sim per batch
    - IPC via ns3-gym or ns3-ai. Latency-aware micro-sim restricted to subgraphs sampled by top-k attention edges/nodes (k configurable).
    - Interface:
      - class NS3Client:
        - run_scenario(scenario_spec) -> sim_report
          - scenario_spec:
            - topology: list of nodes, edges, link params (bw, delay, queue)
            - traffic: app configs per edge (OnOff/BulkSend params)
            - routing/QoS: routing tables or algorithm flags; QueueDisc (CoDel, FQ-CoDel)
            - perturbations: list of counterfactuals (link down, QoS change, bandwidth throttling)
            - duration, seed
          - sim_report:
            - KPIs per edge/node: avg latency, jitter, drops, throughput, queue occupancy
            - counterfactual KPI deltas per perturbation
      - Supports caching by hashing scenario_spec to a persistent store (e.g., LevelDB/SQLite).
  - Option (b) Offline simulation cache + learned surrogate (recommended default)
    - Pre-generate a bank of scenarios per enterprise graph: baseline + counterfactuals (e.g., link failure, 2x delay, QoS flip). Run ns-3 offline to collect KPIs. Train a surrogate MLP g_phi on local graph descriptors + scenario descriptor to predict KPI vectors. Use g_phi during training for differentiable L_sim and L_fid; refresh cache periodically.
    - Interface:
      - class SimCache:
        - lookup(graph_id, subgraph_id, scenario_id) -> sim_report or None
        - store(graph_id, subgraph_id, scenario_id, sim_report)
      - class SurrogateKPI(nn.Module):
        - forward(graph_feats, scenario_feats) -> kpi_pred [E_local, K]
      - class ScenarioSampler:
        - sample_counterfactuals(attn_mask, budget) -> list of scenario_specs

- Simulation feedback layer
  - Role: Map attention masks to scenario proposals; acquire KPI deltas from simulator/surrogate; compute L_sim and L_fid; optionally update a replay buffer of informative scenarios.
  - Interface:
    - class SimFeedback:
      - propose_scenarios(data, edge_attn, num_cf=3, budgets) -> scenario_specs
      - eval_scenarios(scenario_specs) -> kpi_deltas
      - compute_losses(edge_attn, node_emb, kpi_deltas) -> {L_fid, L_sim, L_sp}
  - KPI set:
    - For edges: latency (ms), drop_rate, throughput (Mbps), jitter (ms).
    - For nodes: queue occupancy, CPU proxy (optional), aggregate throughput.

- Explainer baselines and black-box IDS baselines
  - Explainers:
    - PGExplainer (fast, parameterized): train on embeddings; O(|E|) per instance.
    - GNNExplainer: slower, used for SOTA comparison on sampled cases.
  - IDS baselines:
    - RF and XGBoost on flow-level aggregated features (same as edge features aggregated per node/host).
    - 1D-Transformer on time-series of per-host counters (bytes, flows, failed ratio) with sliding windows.

- Visualization toolkit
  - API:
    - visualize_explanations(graph, node_scores, edge_scores, layout=”kamada-kawai”, overlay_sim_kpis=True) -> PNG/HTML
    - visualize_what_if(graph, scenario_spec, before_after_kpis, attention_overlay=True)
  - Outputs reproducible artifacts (SVG/HTML) for case studies.

## Data processing pipeline
- Cisco preprocessing to PyG
  - Source: UCI ML Repo: “Cisco Secure Workload Networks of Computing Hosts” (22 disjoint directed temporal enterprise graphs; CC BY 4.0). Also SNAP mirror. Paper: Omid Madani et al., IWSPA 2022.
  - Steps:
    1) Parse per-enterprise host and flow tables.
    2) Time windowing: non-overlapping or overlapping windows (default 300s window, 60s stride).
    3) Build directed graphs per window: nodes=hosts; edges=aggregated flows with attributes.
    4) Compute features (as above); store as PyG Data objects.
    5) Split by enterprise: 14 train / 4 val / 4 test. Within enterprise, maintain chronological order for windows to prevent leakage. Save split JSON with enterprise IDs.

- Labeling approaches
  - Option S (Simulation-based labels; recommended for Cisco)
    - Malicious node injection generator:
      - Behaviors: port scan (high out-degree, short IAT, low bytes), brute-force (frequent failed auth proxies if available), data exfiltration (large sustained outbound throughput to few destinations), lateral movement (burst edges to high-privilege nodes), beaconing (periodic small flows).
      - Budgets: up to 1–5% nodes per window flagged as malicious; 0.5–2% edges malicious; ensure class imbalance realistic.
      - Feature mimicry: match benign marginals (degree/bytes) ± small epsilon to avoid trivial detection; perturb second-order stats (IAT variance, SYN/ACK ratios).
    - Label assignment: injected nodes=malicious (node-level); edges generated by malicious behaviors=malicious (edge-level). Also generate benign counterfactuals by injecting load spikes without malicious signatures to calibrate specificity.
    - Sampling: per window sample at most B behaviors (default B=2); ensure diversity across enterprises; fix seeds for reproducibility.
  - Option L (Labeled supplement)
    - Integrate CIC-IDS 2018:
      - Graphification: nodes=IPs; edges=aggregated flows per window; edge labels from CIC ground truth; derive node labels by OR over incident edges within window.
      - Splits: mirror 14/4/4 by day/host groups where possible; or 60/20/20 split by time.
      - Use for fine-tuning/evaluation to achieve the >90% recall target and SOTA comparison.
  - Avoid leakage: ensure enterprise-level splits for Cisco; ensure future windows not used to label past.

- Mapping Cisco features and simulation parameters to ns-3
  - Traffic:
    - Map edge flow rate bytes/s -> OnOffApplication DataRate.
    - IAT mean/std -> OnTime/OffTime distribution parameters (default: log-normal fit; clip within [1ms, 1s]).
    - Flow duration -> Application start/stop times within window.
    - Packet sizes if absent -> infer from bytes/packets (fallback to 1500B MTU averages).
  - Topology links:
    - Use PointToPoint links with bandwidth from enterprise policy if available; otherwise infer from observed 95th percentile throughput × 2 as capacity; link delay from RTT/2 if RTT estimable; else assign 1–10ms based on LAN/WAN tag.
    - Queue: Traffic Control with FQ-CoDel as default; map QoS policy flags to QueueDisc choice (CoDel, FQ-CoDel) and queue size.
  - Routing:
    - Static/GlobalRouting for default; OLSR as stand-in for dynamic routing if needed. Recompute tables on counterfactuals.
  - Counterfactual scenario generator:
    - Link failures: SetLinkDown on selected edges.
    - Delay perturbations: multiply delay by {0.5, 2.0}.
    - Bandwidth throttling: set capacity to {0.5x, 2x}.
    - QoS toggles: switch QueueDisc; adjust target/interval for CoDel.
    - Routing recomputation: trigger Ipv4GlobalRouting::RecomputeRoutingTables.

## Training and evaluation
- Prediction tasks
  - Default: Node-level malicious host detection (binary). Auxiliary: edge-level malicious flow detection (binary).
  - Output heads: node_logits [N, 2]; edge_logits [E, 2].

- Losses and multi-objective
  - Overall objective:
    L = L_cls + λ_fid L_fid + λ_sim L_sim + λ_sp L_sp
  - L_cls:
    - For node-level: weighted cross-entropy on node_logits with class weights to counter imbalance; optional focal loss variant.
    - Add auxiliary edge-level loss if enabled (weight α_edge default 0.3).
  - L_fid (explanation fidelity vs simulator counterfactuals):
    - Let a_e be learned edge importance (from attention explanation head), normalized in [0,1].
    - For each sampled counterfactual c that perturbs edge e (e.g., link down), obtain ΔKPI_e^c = ||KPI^c - KPI^base||_2 on edge/node-local KPIs.
    - Define target importance t_e = normalize(Avg_c ΔKPI_e^c) across edges in the subgraph.
    - Use correlation/ranking loss to align a_e with t_e:
      - L_rank = 1 - SpearmanCorr(a, t) over subgraph edges (stop-gradient on t).
      - Or pairwise hinge: sum_{(i,j)} max(0, margin - (a_i - a_j)) for pairs where t_i > t_j.
    - Set L_fid = L_rank.
  - L_sim (consistency with sim KPIs):
    - A KPI decoder h_θ maps node/edge embeddings to predicted KPIs: k̂_e = h_θ(z_e).
    - If using simulator: minimize MSE to measured KPIs: L_sim = Σ_e ||k̂_e - KPI_e||_2^2.
    - If surrogate g_φ used: also include teacher-student KL between g_φ and h_θ on scenario batch to stabilize.
  - L_sp (sparsity for explanations):
    - L1 on attention masks: λ_L1 Σ_e |a_e| + entropy penalty encouraging peaked distributions.
    - Constraint projection: top-k sparsification during training (k schedule below).
  - Weighting defaults:
    - λ_fid=1.0, λ_sim=0.5 (if accurate surrogate), λ_sp=0.1; tune on validation grid: {0.5,1.0,2.0} for λ_fid; {0.1,0.5,1.0} for λ_sim; {0.05,0.1,0.2} for λ_sp.
  - Schedules:
    - Curriculum: warm-up 5 epochs with only L_cls; then introduce L_fid and L_sim linearly over next 5 epochs.
    - Top-k schedule for attention sparsity: k from 20% edges down to 5% over 30 epochs.
    - Chaos engine/intensity (Proposal 1): start with fixed perturbations; move to adaptive adversarial chaos after epoch 20.

- NS-3 integration patterns
  - Option (a) In-the-loop:
    - Per batch, select a local subgraph around top-K suspicious nodes/edges (K=50 edges default) and run a 200ms micro-sim with 3 counterfactuals. Cache results keyed by (graph_id, subgraph hash, scenario hash).
    - Use ns3-ai shared memory or ns3-gym sockets; batch size limited by sim latency (batch_size default 1–2 graphs).
  - Option (b) Offline + surrogate (recommended default):
    - Precompute ~2k scenarios per enterprise graph; train g_φ to R^K KPI prediction with MAE < 5% on held-out scenarios.
    - During training, query g_φ for fast KPI deltas; periodically validate with real ns-3 on 10% sampled scenarios to bound drift.

- Hyperparameters and optimization
  - Architectures:
    - GATv2: layers {2,3,4}; hidden {128,256}; heads {4,8}; dropout {0.3,0.5}; LR {1e-3, 3e-3}; weight decay {1e-5, 1e-4}.
    - GraphSAGE (mean) and GIN baselines: layers {2,3}, hidden {128,256}, dropout {0.3,0.5}, LR {1e-3, 3e-3}.
  - Training:
    - Optimizer: AdamW; cosine LR schedule with warmup 5 epochs; early stopping on val AUC-ROC (patience 15).
    - Epochs: 100 default; batch size: 1 graph/window (accumulate gradients to effective batch of 4).
    - Seeds: 42, 123, 2025 (report mean±std).
  - Memory/performance tips:
    - Use PyG SparseTensor for large graphs; neighbor sampling for models that support it; pin_memory, cuda graphs; .float16 autocast for inference latency tests.

- Explainer runtime to meet <100 ms
  - Default explanations use attention-head masks (already computed). To ensure <100 ms:
    - Cache node/edge attention and top-k indices; store per window.
    - PGExplainer: pre-train explainer; runtime O(|E|) with small constant; restrict to 1-hop ego-nets around flagged nodes; target <100 ms by limiting edges to ≤500 per explanation.
    - GNNExplainer only on a small subset for SOTA comparison.

- Baseline IDS and performance gap metric
  - RF/XGBoost features: per-host and per-edge aggregates used by GNN; add temporal lags (t-1, t-2).
  - 1D-Transformer: input sequence length 60 (60s strides over 1 hour), channels {bytes_in, bytes_out, flows, fail_ratio}.
  - Train/val/test on the same splits. Performance gap metric:
    - Gap = 1 - (Recall_Hybrid / Recall_BlackBox)
    - Target: Gap ≤ 0.15 at operating point tuned for TPR≥0.9 on validation.

- Metrics and protocols
  - Baseline comparison:
    - Metrics: AUC-ROC, AUPRC, F1 (threshold tuned on val), Recall@FPR=1%.
    - Report per-enterprise and macro-averages.
  - Ablation (remove simulation feedback):
    - Metrics: fidelity via deletion/insertion (drop in prediction when removing top-k important edges vs random), Fidelity+ (mutual information proxy), sparsity (% edges used).
    - Also report KPI-alignment correlation between attention and ΔKPI ranks.
  - Scalability:
    - Graph sizes: 100, 1k, 5k, 10k nodes via subgraph sampling or synthetic RMAT; measure inference latency (mean over 100 runs; 10 warmups), cuda.synchronize before timing; report hardware (e.g., A100 40GB).
  - Adversarial robustness:
    - Malicious node injection budgets: ε_nodes ∈ {0.5%, 1%, 2%}, ε_edges ∈ {0.5%, 1%, 2%}; degree/feature mimicry constraints.
    - Report FPR at fixed Recall=0.9; robustness curves (AUC under increasing ε).
  - Case studies (≥3 Cisco enterprises):
    - Visualize attention overlays; run what-if ns-3 analyses: link failure, routing change, QoS toggle; show before/after KPIs; narrate causal chains.

## Reproducibility and reporting
- Seeds and splits
  - Seeds: [42, 123, 2025]; fixed across training and evaluation; log RNG states.
  - splits/cisco_splits.json:
    - { "train": [enterprise_ids...], "val": [...], "test": [...] }
  - For CIC-IDS 2018, provide splits/cic_splits.json by day/time.

- JSON logging schema (per epoch and per evaluation)
  - train_epoch.jsonl entries:
    - { "epoch": int, "loss": float, "loss_cls": float, "loss_fid": float, "loss_sim": float, "loss_sp": float, "auc": float, "f1": float, "recall": float, "precision": float, "kpi_mae": float, "attn_sparsity": float, "lr": float, "seed": int, "graph_batch": [ids] }
  - eval.jsonl entries:
    - { "split": "val|test", "auc": float, "auprc": float, "f1": float, "recall@fpr1": float, "latency_ms": float, "explain_time_ms": float, "fidelity_del": float, "fidelity_ins": float, "kpialign_corr": float, "robust_auc": float, "compute_sec": float, "config": { ... } }

- Artifact paths and versioning
  - data/processed/{cisco|cic}/
  - sims/cache.sqlite, sims/reports/
  - models/checkpoints/{timestamp}/
  - logs/{experiment_id}/
  - viz/{case_study_id}/
  - Use git tags and a VERSION file; log git commit in JSON.

- Tables and plots for paper
  - Main table: AUC/F1/Recall and performance gap vs black-box (RF/XGBoost/Transformer).
  - Ablation table: with/without L_fid, L_sim, chaos/curiosity loop; explanation metrics.
  - Scalability plot: latency vs nodes (log-scale x).
  - Robustness curves: FPR vs perturbation budget; AUC of robustness.
  - Case study visuals: attention overlays; KPI before/after counterfactuals; resource heatmaps (Proposal 2).

- Citations to include
  - Cisco dataset: “Cisco Secure Workload Networks of Computing Hosts,” UCI ML Repository; SNAP mirror; paper: Omid Madani et al., “A Dataset of Networks of Computing Hosts,” IWSPA 2022.
  - GNNs: GATv2 (How Attentive are Graph Attention Networks? arXiv:2105.14491); GraphSAGE (arXiv:1706.02216); GIN (arXiv:1810.00826).
  - ns-3 integration: ns3-gym (arXiv:1810.03943); ns3-ai (ns-3 mainline).
  - Explainers: GNNExplainer (arXiv:1903.03894); PGExplainer (arXiv:2011.04573).
  - IDS datasets: CIC-IDS 2017/2018; LANL 2015 (arXiv:1708.07518) as alternative.
  - Robustness: Nettack (arXiv:1805.07984), Metattack (arXiv:1902.08412).

## Recommended defaults summary
- Integration pattern: Offline simulation cache + surrogate (Option b) for training; use in-the-loop micro-sim ns-3 runs at validation time for calibration.
- Model: GATv2 with 3 layers, hidden 128, heads 4, dropout 0.4; attention explanation heads=2.
- Loss weights: λ_fid=1.0, λ_sim=0.5, λ_sp=0.1; curriculum with 5-epoch warmup.
- Labeling: Option S for Cisco; add CIC-IDS 2018 as supplementary evaluation and optional fine-tuning.
- Explanations: attention masks by default; PGExplainer for additional speed; GNNExplainer for SOTA comparison on sampled instances.
- Counterfactual budgets: 3 scenarios per batch (offline surrogate), top-5% edges by attention for subgraph micro-sims.

### Explicit module interfaces (concise)
- CiscoDataset.__getitem__ returns:
  - Data.x [N, F], Data.edge_index [2, E], Data.edge_attr [E, Fe], Data.y_node [N] (binary), Data.y_edge [E] (optional), Data.graph_id (int), Data.window_idx (int)
- GATv2IDS.forward(Data) -> dict:
  - node_logits