# exp/proc (exp2 Trace Mapping/Export)

This directory provides tools for processing trace results from `exp/exp2/run_exp.py --save_hop_traces` into simplified sample-level `.npz` files for external use.

Main files:
- `exp/proc/map_exp2_traces_to_proc.py`: Reads exp2 trace run folders (`manifest.jsonl` + `ex_*.npz`), outputs simplified format to `exp/proc/output/`.

---

## Input Requirements

You need to provide (or can be auto-inferred):
- `--trace_dir`: exp2 trace run folder, e.g.:
  - `exp/exp2/output/traces/exp/exp2/data/morehopqa.jsonl/qwen-8B/ifr_all_positions_mfaithfulness_gen_95ex/`
- `--dataset_jsonl`: exp2 cache dataset corresponding to this trace run (must contain `prompt` + `target`), e.g.:
  - `exp/exp2/data/morehopqa.jsonl`
- `--tokenizer_model`: Tokenizer consistent with exp2 attribution (local path or model name), e.g.:
  - `/opt/share/models/Qwen/Qwen3-8B/`

Notes:
- This script strictly replicates exp2's token alignment logic (prompt leading space, generation uses `target + eos_token` then decode + offset slice), so tokenizer must match exp2 attribution exactly, otherwise will error directly (length mismatch).
- Sample matching uses `prompt_sha1/target_sha1` from `manifest.jsonl` to align with `--dataset_jsonl`; so `--dataset_jsonl` must be the cache used for this trace run.

---

## Output Location and Naming

Default output to:
- `exp/proc/output/<isomorphic path after traces/>/`

For example, input:
- `.../output/traces/exp/exp2/data/morehopqa.jsonl/qwen-8B/<run_tag>/`

Default output:
- `exp/proc/output/exp/exp2/data/morehopqa.jsonl/qwen-8B/<run_tag>/`

You can also use `--out_dir` to explicitly specify output directory.

Output directory contains one file per sample: `ex_000000.npz`, `ex_000001.npz` ...

---

## Output `.npz` Fields (Simplified, Contains Only Necessary Information)

Each output sample `.npz` **only contains** the following keys:
- `attr`: `float32[L]`, row attribution vector; chat template removed, EOS removed, only covers valid tokens of `input+cot+output`.
- `hop`: `float32[H, L]` (optional, FT-IFR methods only), per-hop vectors; also EOS removed, aligned with `attr` length.
- `tok`: `U[L]`, token text fragment sequence strictly aligned with `attr/hop` (also no chat template or EOS).
- `span_in`: `int64[2]`, input's closed interval range in the vector.
- `span_cot`: `int64[2]`, cot's closed interval range in the vector (`[-1, -1]` if no cot).
- `span_out`: `int64[2]`, output's closed interval range in the vector.
- `rise`: `float64`, row's RISE (faithfulness).
- `mas`: `float64`, row's MAS (faithfulness).
- `recovery`: `float64`, row's Recovery@10% (NaN if no recovery).

---

## Usage Examples

Most common (recommend explicitly passing dataset and tokenizer):
```bash
python exp/proc/map_exp2_traces_to_proc.py \
  --trace_dir exp/exp2/output/traces/exp/exp2/data/morehopqa.jsonl/qwen-8B/ifr_all_positions_mfaithfulness_gen_95ex \
  --dataset_jsonl exp/exp2/data/morehopqa.jsonl \
  --tokenizer_model /opt/share/models/Qwen/Qwen3-8B/
```

Explicitly specify output directory (avoid default isomorphic path):
```bash
python exp/proc/map_exp2_traces_to_proc.py \
  --trace_dir exp/exp2/output/traces/exp/exp2/data/math.jsonl/qwen-8B/ifr_multi_hop_both_n1_mfaithfulness_gen_100ex/ \
  --dataset_jsonl exp/exp2/data/math.jsonl \
  --tokenizer_model /opt/share/models/Qwen/Qwen3-8B/ \
  --out_dir exp/proc/output/math_ifr_multi_hop_both
```

Debug: process only first 5, allow output file overwrite:
```bash
python exp/proc/map_exp2_traces_to_proc.py \
  --trace_dir ... \
  --dataset_jsonl ... \
  --tokenizer_model ... \
  --limit 5 \
  --overwrite
```

---

## Common Issues

- Error "Prompt/Generation token length mismatch"
  - Almost always tokenizer mismatch; confirm `--tokenizer_model` is exactly the same tokenizer used during exp2 attribution (recommend using the same `--model_path`).
- Error "Failed to match manifest sha1 to dataset_jsonl"
  - `--dataset_jsonl` is not the cache used for this trace run, or cache lacks `target`.
- FT-IFR method output missing `hop`
  - For `ifr_multi_hop_stop_words/ifr_multi_hop_both/ifr_multi_hop_split_hop/ifr_in_all_gen`, exp2 trace must contain `vh`; if trace is old, please re-run exp2 (with `--save_hop_traces`).
  - If needed, add `--allow_missing_ft_hops` to force output (not recommended).
