# vStream (Submission-Only Code)

This directory contains a clean, reviewer-oriented re-packaging of the vStream code.
It is designed for ICML/NeurIPS supplementary material: minimal, readable, and runnable.

Key constraints of this submission package:
- Supports `Qwen/Qwen3-VL-8B-Thinking`.
- Two main workflows:
  - `scripts/online_train.py`: end-to-end training from a single (image, question) sample.
  - `inference/infer_attribution.py`: apply trained `theta` to produce visual attributions.
- No Hydra; configuration is simple `argparse` + Python dataclasses.

## Overview (What This Code Demonstrates)

vStream learns a single linear weight vector `theta` over attention heads.

TRAINING (GPU required):
1) Run Qwen3-VL-8B-Thinking to produce a `<think> ... </think>` span.
2) Extract per-head attention features from the model.
3) Run ablation experiments (mask visual sources, measure output change).
4) Train a linear estimator `theta` (1152 dims = 32 layers x 36 heads) to predict ablation effects.

INFERENCE (GPU required):
1) Run Qwen3-VL-8B-Thinking to produce a `<think> ... </think>` span.
2) Extract per-head attention features from the model.
3) Compute attribution scores: `score(source) = features(source) · theta`.
4) Export a heatmap overlay + a JSON dump.

## Installation

From this directory:

```bash
pip install -r requirements.txt
```

Optional (for DINOv3 region clustering):

```bash
pip install scikit-learn
```

## Quickstart: TRAINING (GPU)

End-to-end training from a single sample. Requires Qwen3-VL-8B-Thinking weights.

```bash
python scripts/online_train.py \
  --weights_dir /path/to/hf_cache \
  --image assets/example_image.png \
  --question "What is shown?" \
  --out_dir /tmp/online_demo \
  --steps 100 \
  --lr 1e-3 \
  --num_masks 32 \
  --alpha 0.5
```

Key parameters:
- `--steps`: Number of training iterations (each uses different random masks)
- `--num_masks`: Ablation masks per step (higher = more signal, slower)
- `--alpha`: Mask sparsity (0.5 = 50% sources kept per mask)
- `--checkpoint_every`: Save intermediate checkpoints

Outputs:
- `estimator-step{N}.pt`: Intermediate checkpoints
- `estimator.pt`: Final trained estimator

## Quickstart: INFERENCE (GPU)

You must have the Qwen3-VL-8B-Thinking weights available locally.
Set `--weights_dir` to your HuggingFace cache directory (offline is supported).

```bash
python inference/infer_attribution.py \
  --weights_dir /path/to/hf_cache \
  --image assets/example_image.png \
  --question "What is shown?" \
  --theta assets/estimator_qwen3vl8b.pt \
  --out_dir /tmp/vstream_demo \
  --source_mode block \
  --block_h 2 --block_w 2
```

Outputs:
- `/tmp/vstream_demo/image.png`
- `/tmp/vstream_demo/attribution.png`
- `/tmp/vstream_demo/data.json`

## Verification Scripts

```bash
# CPU-only smoke test (no GPU required)
python scripts/verify_cpu.py

# GPU inference test (requires Qwen weights)
python scripts/verify_gpu_infer.py --weights_dir /path/to/hf_cache --out_dir /tmp/test
```

## DINOv3 Regions (Optional)

To use DINOv3 clustering as region sources:

```bash
python inference/infer_attribution.py \
  ... \
  --source_mode dinov3 \
  --dinov3_cache_dir /path/to/hf_cache
```

This path is optional and isolated; the default demo uses simple grid-based sources.

## DINOv3 Attention-Weighted Patch Scores (Optional)

By default, region attribution scores are distributed uniformly across patches within each region.
To use DINOv3 attention as a spatial prior for finer localization:

```bash
python inference/infer_attribution.py \
  ... \
  --use_dino_attention \
  --dinov3_model facebook/dinov3-vitl16-pretrain-lvd1689m \
  --dinov3_cache_dir /path/to/hf_cache
```

This redistributes region scores to patches using DINOv3's last-layer CLS attention:

$$s_i = \widehat{\Delta}_{R(i)}(S) \cdot \frac{a^\mathtt{DINO}_i}{\sum_{j \in R(i)} a^\mathtt{DINO}_j}$$

Where $a^\mathtt{DINO}_i$ is the attention weight from the CLS token to patch $i$.
