# OvA-LP Codebase — Configuration Reference (Semantics Only, PFPT/FMoE fixed)

This document explains what each configuration key means. It intentionally avoids defaults, typical values, and runnable examples.

> Scope
> - Core models: LP-Softmax, OvA-LP
> - Baseline extensions (if present in your branch): PFPT, FMoE
> - Datasets referenced: CIFAR-10, CIFAR-100

---

## Top-Level Keys

- **`exp_name: str`** — Human-readable tag used in logs and result filenames.
- **`seed: int`** — Random seed controlling client sampling, noise injection, shuffling, etc.
- **`model_name: str`** — Selects the client-side learning rule.
  - `lp_softmax` — single linear head trained with cross-entropy on frozen encoder features.
  - `ova_lp` — one-vs-all linear probing with two training stages (see model block).
  - `pfpt` — probabilistic federated prompt tuning (baseline extension).
  - `fmoe` — sparse mixture-of-experts adapters (baseline extension).
- **`aggregator_name: str`** — Chooses the server-side combiner.
  - `fedavg` — sample-count-weighted averaging of client parameters.
  - `pfpt_agg` — probabilistic prompt aggregation (Hungarian matching + set updates).
  - `fmoe_agg` — averaging of trainable MoE/adapter parameters.
- **`data: {...}`** — Dataset and client partitioning.
- **`encoder: {...}`** — Backbone and feature handling.
- **`train: {...}`** — Federated loop and optimizer knobs.
- **`model: {...}`** — Model-specific fields.

---

## `data` Block

- **`dataset: "cifar10" | "cifar100"`** — Selects dataset loaders/transforms.
- **`num_clients: int`** — Total logical clients in the federation.
- **`batch_size: int`** — Mini-batch size used on each client.
- **`partition: str | dict`** — Specifies how the global dataset is split among clients.

  **String form** (shorthand):
  - `iid` — uniform-by-size split.
  - `dirichlet` — class-conditional Dirichlet allocation of samples.
  - `zipf` — Zipf-distributed client sizes.
  - `shard1`, `shard2` — one or two label shards per client.
  - `feature_skew` — feature-based clustering (KMeans on precomputed features); clusters are assigned to clients. Requires features to exist (see `encoder.precompute`).

  **Dict form** (explicit):
  - `{"type":"iid"}`
  - `{"type":"dirichlet","alpha":<float>}`
  - `{"type":"zipf","s":<float>}`
  - `{"type":"shard1"}` / `{"type":"shard2"}`
  - `{"type":"feature_skew"}`

- **`min_samples_per_client: int`** — Post-partition lower bound on samples per client; may trigger rebalancing.
- **`noise_ratio: float`** — Fraction of labels to corrupt.
- **`noise_mode: "symmetric" | "asymmetric"`** — Label-noise scheme; `asymmetric` flips within coarse groups on CIFAR-100.

---

## `encoder` Block

- **`type: "vit" | "dinov2"`** — Encoder family.
- **`size: "tiny" | "small" | "base" | "large" | "huge" | "giant"`** — Model scale; determines the feature dimension used by linear heads.
- **`patch: int`** — Vision transformer patch size.
- **`from_pretrained: bool`** — Initialize from a checkpoint provided by `timm`.
- **`tune: bool`** — Fine-tune backbone weights (`True`) or freeze (`False`).
- **`precompute: bool`** — Compute and cache features once and train on those features. Required for feature-based models and for `partition: "feature_skew"`.
- **`use_cache: bool`** — Reuse saved feature tensors between runs.
- **`precompute_batch_size: int`** — Batch size used during the feature precompute pass.

---

## `train` Block

- **`num_rounds: int`** — Number of communication rounds.
- **`active_client_ratio: float`** — Fraction of clients sampled per round.
- **`local_epochs: int`** — Number of local training epochs per selected client in a round.
- **`lr: float`** — Optimizer learning rate.
- **`weight_decay: float`** — Optimizer weight decay.
- **`log_every: int`** — Interval (in rounds) at which to log metrics.
- **`skip_if_exists: bool`** — If a matching results file already exists for the given configuration, reuse it instead of retraining.
- **`aux_lambda: float`** — Coefficient of the **KL load-balancing auxiliary loss** used by **FMoE** (unused by other models).

---

## `model` Block (Common Fields)

- **`num_classes: int`** — Number of target classes (set according to the dataset).
- **`in_dim: int`** — Input feature dimension for linear heads; inferred from the encoder when omitted for feature-based models.

---

## Model Families

### LP-Softmax (`model_name="lp_softmax"`)
- **Input**: precomputed encoder features.
- **Head**: single linear classifier trained with cross-entropy.
- **Aggregation**: `fedavg` on linear head parameters.

### OvA-LP (`model_name="ova_lp"`)
- **Two-stage training** on precomputed features:
  - **Stage-1**: per-class binary heads trained with positive-only BCE; the server aggregates heads **row-wise** (per class) using per-class counts.
  - **Stage-2**: unified one-vs-all BCE head; the server performs standard (scalar) FedAvg.
- **`num_stage1_rounds: int`** — Number of initial rounds that use Stage-1 (the code switches when `round_idx < num_stage1_rounds` no longer holds).
- **`min_samples_stage1: int`** — Minimum local positives required for a class to participate in Stage-1 on a client.

### PFPT (`model_name="pfpt"`; baseline extension)
- **Input**: images (no feature precompute).
- **Server aggregation (`pfpt_agg`)**: treats client prompt sets as samples from a probabilistic set model and learns a global set of summarizing prompts Φ via an iterative, EM-like procedure:
  - **Assignment**: align local prompts to Φ using weighted bipartite matching (Hungarian).
  - **Update**: refine Φ; inactive prompts are pruned, keeping Φ non-parametric over rounds.
- **`num_tokens: int`** — Number of prompt tokens inserted into the encoder.
- **`num_classes: int`** — Number of classes for the classification head built on top of the prompted encoder.

### FMoE (`model_name="fmoe"`; baseline extension)
- **Input**: images (no feature precompute).
- **Design**: inserts sparse MoE adapters in FFN blocks; each input is routed to top-K experts by a lightweight router.
- **Server aggregation (`fmoe_agg`)**: averages trainable adapter/head parameters weighted by client sample counts.
- **Load balancing**: an auxiliary KL-divergence term encourages balanced expert utilization under Non-IID data.
- **Router**: uses a standard softmax over routing logits (no temperature parameter in the current code).
- **Mixing weight — `MoE_alpha`**: scalar factor that multiplies the MoE path before combining with the original FFN path. In the current code this is fixed to 4; if exposed via cfg, it would control the relative contribution of MoE vs. FFN.
- **`num_experts: int`** — Number of experts per MoE layer.
- **`rank_per_expert: int`** — Bottleneck rank of each expert MLP/adapter.
- **`top_k: int`** — Number of experts activated per token.
- **`num_classes: int`** — Number of classes for the classifier head.

---

## Outputs

- **Per-round CSV** with the following columns:
  - `round` — round index
  - `loss` — evaluation loss
  - `acc` — evaluation accuracy
  - `round_client_time_sec`, `round_each_client_time_sec`, `round_server_time_sec` — timing breakdowns
  - `cum_client_time_sec`, `cum_each_client_time_sec`, `cum_server_time_sec`, `cum_time_sec` — cumulative timings
- **File naming** is derived from the configuration (including a hash) and `exp_name` so that runs with different settings do not collide.

---

## Notes on Partitions & Features

- The `feature_skew` partition relies on precomputed features. Ensure `encoder.precompute` is enabled before creating this partition.
- Feature-based models (`lp_softmax`, `ova_lp`) operate on cached features; image-based baselines (`pfpt`, `fmoe`) process raw images end-to-end.

---

_End of semantics-only reference (PFPT/FMoE fixed)._
