# FlashTrace Experiment 2 (Multi-hop Reasoning Faithfulness)

This directory provides experimental tools for "11 datasets × 9 methods × 3 metrics", **skipping AT2** and **skipping math**. The workflow consists of two steps: first sample and filter high-quality CoT+boxed generations, then run attribution evaluation on the filtered results.

Supported datasets: MoreHopQA, HotpotQA (RULER hotpotqa_long), RULER niah (niah_*), RULER variable tracking (vt_*). RULER paths are automatically searched in `data/ruler_multihop/<len>/.../validation.jsonl`.

Main files:
- `sample_and_filter.py`: Sampling + consistency validation, outputs to `exp/exp2/data/`
- `run_exp.py`: Attribution testing, outputs to `exp/exp2/output/`
- `dataset_utils.py`: Data loading, answer span parsing

Datasets supported by the sampling script:
- `morehopqa` (local `data/with_human_verification.json`)
- `hotpotqa_long` (auto-search in `data/ruler_multihop/<len>/hotpotqa_long/validation.jsonl`)
- `niah_*` (RULER niah variants, auto-search as above)
- `vt_*` (RULER variable tracking variants, auto-search as above)
- Direct RULER JSONL path (treated as dataset name), other types not supported

Attribution testing support:
- Datasets: Prioritizes `exp/exp2/data/<name>.jsonl` cache; if not available, loads using the same parsing rules as sampling; math is explicitly rejected.
- Metrics:
  - `faithfulness_gen` (generation-side): Can run on any loaded samples (except math).
  - `recovery_ruler` (recovery rate, RULER only): Recall@10% (ranking only on prompt tokens, gold from `needle_spans`).
- Methods (`--attr_funcs`): `IG`, `perturbation_all`, `perturbation_CLP`, `perturbation_REAGENT`, `attention` (internally fused with IG), `ifr_all_positions`, `ifr_multi_hop`, `attnlrp`, `ft_attnlrp`, `basic`. AT2 not provided.

---

## Data Sampling

Implementation logic:
- Unified data loading: `DatasetLoader` reads MoreHopQA / HotpotQA / RULER niah / RULER vt; can also pass custom RULER JSONL directly.
- Generation model: `qwen3-235b-a22b-2507` (English system prompt), requires "brief thinking first, then wrap final answer in `\box{}` with no trailing content"; user prompt is the original question, no additional template.
- Judge model: `deepseek-v3-1-terminus` (English system prompt), outputs only True/False to judge whether the content in `\box{}` matches the reference answer.
- Filtering: Only keeps samples with "thinking + trailing boxed answer" that are judged as True; `target` is reconstructed from the extracted thinking segment and **the final answer without box wrapper**, with token-level `sink_span`/`thinking_span`, `reference_answer`, `judge_response` (no longer stores `candidate_answer`), `indices_to_explain` is uniformly set to `sink_span` (boxed content's generation token span in `target`, [start_tok, end_tok]).
- Sampling tries samples in original order, immediately skips on judge failure; stops early when `--max_examples` successful samples are accumulated (fewer if source data is insufficient), tqdm displays both attempt and success counts.

Usage:
```bash
export FLASHTRACE_API_KEY=sk-xxx-xxx-xxxx  # or OPENAI_API_KEY

# Example: sample hotpotqa_long, keep up to 100 samples judged as True
python exp/exp2/sample_and_filter.py \
  --dataset data/with_human_verification.json \
  --max_examples 100 \
  --api_key sk-xxx-xxx-xxxx \
  --tokenizer_model /opt/share/models/Qwen/Qwen3-8B > exp/exp2/out.log
```
Common parameters:
- `--dataset`: morehopqa | hotpotqa_long | niah_* | vt_* (or direct JSONL path)
- `--max_examples`: Number of successful samples to keep; stops when reached (fewer if source data is insufficient)
- `--tokenizer_model`: Tokenizer for span detection (defaults to generation model)
- `--api_base`/`--api_key`: API endpoint and key (default local http://localhost:4000/v1)
- `--request_interval` / `--judge_interval`: Generation/judge interval throttling (default 1s)
- `--rate_limit_delay`: Wait time in seconds when encountering HTTP 429 (default 5s); auto-sleeps before retry
Output: `exp/exp2/data/<dataset>.jsonl`

---

## Attribution Testing

Implementation logic:
- Input: Prioritizes reading `exp/exp2/data/<dataset>.jsonl` (filtered cache); if not exists, falls back to original data parsing.
- Methods: Faithfulness (token-level RISE/MAS) aligns with `evaluations/faithfulness.py` logic (AT2 not implemented), math auto-rejected.
- Multi-hop FlashTrace: If cache contains `sink_span`/`thinking_span`, uses them for multi-hop IFR; otherwise defaults to entire answer as sink.
- A single run can evaluate multiple metrics: `--mode` supports multiple values and comma separation (e.g., `--mode faithfulness_gen,recovery_ruler` or `--mode faithfulness_gen, recovery_ruler`), performs attribution only once for the same batch of samples.
- Optional sample-level trace saving: Adding `--save_hop_traces` saves attribution vectors and per-sample metrics for **all methods, all samples** to `exp/exp2/output/traces/...`; for multi-hop methods, also saves per-hop token-level vectors `V_h` (single `vh`, the vector actually participating in multi-hop propagation), and records `attnlrp_neg_handling/attnlrp_norm_mode` etc. in manifest.
- Known compatibility: Some tokenizers have token merging at chat template boundaries, causing evaluation to fail when locating user prompt via token-id subsequence; exp2 has been modified to directly reuse `user_prompt_indices` computed during attribution for perturbation positioning.
- Batch size estimation: Uses the original script's conservative estimate `(max_input_len-100)/len(tokenizer(format_prompt(prompt)+target))` (at least 1). `max_input_len` is determined by code's built-in mapping table based on `--model` string; defaults to 2000 if not matched or only `--model_path` is passed; if you need the mapped value but use a local path, also pass the corresponding `--model` name.
- Timing: Times attribution computation (recovery/faithfulness) for each sample separately, appends `Avg Sample Time (s)` to CSV end and prints average time to console.
- Output: `exp/exp2/output/faithfulness/...`, `exp/exp2/output/recovery/...`, and (optionally) `exp/exp2/output/traces/...`, organized by dataset and model subdirectories.

Usage:
```bash
# Generation-side RISE/MAS faithfulness perturbation_all_fast,perturbation_CLP_fast,perturbation_REAGENT_fast,ifr_multi_hop_stop_words,ifr_multi_hop_both,ifr_multi_hop_split_hop,ft_attnlrp,ifr_multi_hop,attnlrp,ifr_all_positions,perturbation_all,perturbation_REAGENT,perturbation_CLP,IG,attention
python exp/exp2/run_exp.py \
  --datasets exp/exp2/data/math.jsonl \
  --attr_funcs IG,attention \
  --model qwen-8B \
  --model_path /opt/share/models/Qwen/Qwen3-8B/ \
  --cuda 2,3,4,5,6,7 \
  --num_examples 100 \
  --mode faithfulness_gen \
  --n_hops 1 \
  --save_hop_traces \
&& python exp/exp2/run_exp.py \
  --datasets exp/exp2/data/morehopqa.jsonl \
  --attr_funcs IG,attention \
  --model qwen-8B \
  --model_path /opt/share/models/Qwen/Qwen3-8B/ \
  --cuda 2,3,4,5,6,7 \
  --num_examples 100 \
  --mode faithfulness_gen \
  --n_hops 1 \
  --save_hop_traces

  # --attnlrp_neg_handling drop \
  # --attnlrp_norm_mode norm
```
Common parameters:
- `--datasets`: Comma-separated dataset names; uses `exp/exp2/data/<name>.jsonl` directly if exists.
- `--attr_funcs`: Comma-separated methods (no AT2); `ifr_multi_hop` and `ft_attnlrp` support multi-hop (controlled by `--n_hops`).
- `--attnlrp_neg_handling`: FT-AttnLRP per-hop negative value handling (`drop`/`abs`).
- `--attnlrp_norm_mode`: FT-AttnLRP normalization and hop ratio switch (`norm`/`no_norm`).
- `--data_root`/`--output_root`: Cache and result directories (default `exp/exp2/data` / `exp/exp2/output`).
- `--mode`: `faithfulness_gen`, `recovery_ruler`, can be multi-value/comma-separated (single attribution outputs multiple metrics); `--num_examples` controls evaluation count. math will be rejected.
- `--save_hop_traces`: Saves sample-level trace to `exp/exp2/output/traces/<dataset>/<model>/<run_tag>/` (each sample `ex_*.npz` + `manifest.jsonl`).
