# exp/exp2 Datasets and Sample Flow

This file describes the datasets supported in Experiment 2, sample structure, and processing methods during the "sampling phase" and "attribution phase".

## Supported Datasets
- `morehopqa` (`data/with_human_verification.json`)
- RULER series JSONL: `hotpotqa_long`, `niah_*`, `vt_*` (auto-search in `data/ruler_multihop/<len>/.../validation.jsonl`), or directly pass any RULER JSONL path
- Other datasets (like math) are explicitly skipped
- Attribution phase also prioritizes cache file `exp/exp2/data/<name>.jsonl`, otherwise parses using above rules; passing an existing JSONL path also loads as RULER structure

### Common Sample Field Definitions
```json
{
  "prompt": "<context+question>",
  "target": "<answer or generation>",
  "indices_to_explain": [start_tok, end_tok] | null, // token-level: generation token span to explain (closed interval)
  "attr_mask_indices": [...],       // legacy: coverage gold sentence indices (not used in current exp2), may be null
  "sink_span": [start, end] | null, // answer fragment in generation tokens
  "thinking_span": [start, end] | null, // CoT fragment in generation tokens
  "metadata": { ... }               // dataset-specific metadata
}
```
- **`CachedExample`**: Unified in-memory structure from `dataset_utils.py`, fields exactly match above JSON, used for sampling phase (loading raw data) and attribution phase (loading cache or raw).
- **Cache line (JSONL)**: Each JSON line written by `sample_and_filter.py`, one-to-one correspondence with `CachedExample` fields.
- **Sampling phase processing flow (common)**:
  1. Load raw dataset samples (`prompt`/`indices_to_explain` etc. kept consistent).
  2. Call generation model with template, requiring "thinking text + trailing \\box{} answer".
  3. If generation doesn't match "thinking + single \\box{} with no tail" format, directly discard sample.
  4. Extract thinking fragment and `\\box{}` content, only use `\\box{}` content to call judge model.
  5. When judged True, reassemble "thinking fragment + answer text without box wrapper" as `target`, and record `sink_span`/`thinking_span` accordingly.
  6. Write to cache: only keep `reference_answer`, `judge_response` (optional `boxed_answer`), no longer store `candidate_answer`.

### Generation Splitting and Span Parsing
- `split_boxed_generation` (`dataset_utils.py`) validates format: must be "non-empty thinking text + single trailing \\box{}" with nothing after the box, otherwise skip directly.
- `target` is reassembled from "thinking fragment + newline + final answer text (no box)".
- `attach_spans_from_answer` uses tokenizer's offset mapping to map final answer's character interval in `target` to token-level indices, yielding `sink_span`; `thinking_span` is closed interval from beginning to one token before `sink_span`. Both are token-level spans, satisfying multi-hop IFR calling conventions.
- `indices_to_explain` is uniformly set to `sink_span` when writing cache (boxed content's generation token span in `target`).

---

## MoreHopQA
- **Raw sample structure (`MoreHopQAAttributionDataset` → `CachedExample`)**
  ```json
  {
    "prompt": "<context concatenation>\\n<question>",
    "target": null,
    "indices_to_explain": null,
    "attr_mask_indices": null,
    "sink_span": null,
    "thinking_span": null,
    "metadata": {
      "answer": "<gold answer>",
      "_id": "<example id>",
      "original_context": <original context structure>
    }
  }
  ```
  - Load timing: `DatasetLoader.load_raw("morehopqa")` produces `CachedExample` in both sampling phase and attribution phase (when no cache).
  - Note: exp2's token-level row/rec needs `target` + locatable answer token span; recommend running `sample_and_filter.py` to produce cache first before attribution evaluation.

- **Sampling phase (generation & filtering then write cache)**
  ```json
  {
    "prompt": "<same as above>",
    "target": "<generated CoT + final answer text (box wrapper removed)>",
    "indices_to_explain": [start_tok, end_tok],
    "attr_mask_indices": null,
    "sink_span": [start_tok, end_tok] | null,
    "thinking_span": [start_tok, end_tok] | null,
    "metadata": {
      "answer": "<gold answer>",
      "_id": "<example id>",
      "original_context": <original context structure>,
      "reference_answer": "<gold answer>",
      "judge_response": "<True/False text>",
      "boxed_answer": "<optional, boxed parse result>"
    }
  }
  ```
  - `sink_span`/`thinking_span`: Only populated when `\\box{}` successfully parsed; `target` is trimmed "thinking + final answer text".
  - Written to: `exp/exp2/data/morehopqa.jsonl`.

- **Attribution phase (load cache priority)**
  - Loading: `run_exp.py` prioritizes `load_cached` (JSONL → `CachedExample`), otherwise falls back to raw structure and online generates `target`.
  - Usage: Faithfulness (token-level RISE/MAS) directly uses cached `target`; `ifr_multi_hop` constrains answer/CoT when `sink_span`/`thinking_span` available, otherwise treats entire generation as sink.

---

## RULER HotpotQA (`hotpotqa_long`)
- **Raw sample structure (`RulerAttributionDataset` → `CachedExample`)**
  ```json
  {
    "prompt": "<input> + <answer_prefix>",
    "target": "<answer_prefix + sep + ', '.join(outputs)>",
    "indices_to_explain": [0],
    "attr_mask_indices": [<sentence indices>...] | null,
    "sink_span": null,
    "thinking_span": null,
    "metadata": {
      "dataset": "ruler",
      "length": <int>,
      "length_w_model_temp": <any>,
      "outputs": [...],
      "answer_prefix": "<str>",
      "token_position_answer": <any>,
      "needle_spans": [
        {
          "title": "<str>",
          "doc_index": <int>,
          "document_number": <int>,
          "sentence_index": <int>,
          "sentence": "<str>",
          "context_span": [start, end],
          "span": [start, end],
          "snippet": "<str>"
        },
        ...
      ],
      "prompt_sentence_count": <int>,
      "reference_answer": "<supplemented in loader, from outputs or target>"
    }
  }
  ```
  - Load timing: `DatasetLoader.load_raw("hotpotqa_long")` produces `CachedExample` in both sampling phase and attribution phase (when no cache).

- **Sampling phase (generation & filtering then write cache)**
  ```json
  {
    "prompt": "<same as above>",
    "target": "<generated CoT + final answer text (box wrapper removed)>",
    "indices_to_explain": [-2],
    "attr_mask_indices": [<sentence indices>...] | null,
    "sink_span": [start_tok, end_tok] | null,
    "thinking_span": [start_tok, end_tok] | null,
    "metadata": {
      "dataset": "ruler",
      "length": <int>,
      "length_w_model_temp": <any>,
      "outputs": [...],
      "answer_prefix": "<str>",
      "token_position_answer": <any>,
      "needle_spans": [...],
      "prompt_sentence_count": <int>,
      "reference_answer": "<outputs concatenation or target>",
      "judge_response": "<True/False text>",
      "boxed_answer": "<optional>"
    }
  }
  ```
  - `attr_mask_indices` keeps original value; `indices_to_explain` unified to last sentence `[-2]` (last non-EOS generation sentence); `sink_span`/`thinking_span` only populated when `\\box{}` successfully parsed; `target` is trimmed "thinking + final answer text".
  - Written to: `exp/exp2/data/hotpotqa_long.jsonl`.

- **Attribution phase (load cache priority)**
  - Loading: Prioritizes `load_cached` (JSONL → `CachedExample`), otherwise falls back to raw parsing.
  - Usage: Coverage uses `attr_mask_indices`; faithfulness and `ifr_multi_hop` use cached `sink_span`/`thinking_span` to locate answer/CoT, treats entire generation as sink if missing.

---

## RULER NIAH / Variable Tracking (`niah_*`, `vt_*`)
- **Raw sample structure (same as RULER generic)**
  ```json
  {
    "prompt": "<input> + <answer_prefix>",
    "target": "<answer_prefix + sep + ', '.join(outputs)>",
    "indices_to_explain": [0],
    "attr_mask_indices": [<sentence indices>...] | null,
    "sink_span": null,
    "thinking_span": null,
    "metadata": {
      "dataset": "ruler",
      "length": <int>,
      "length_w_model_temp": <any>,
      "outputs": [...],
      "answer_prefix": "<str>",
      "token_position_answer": <any>,
      "needle_spans": [...],
      "prompt_sentence_count": <int>,
      "reference_answer": "<supplemented in loader>"
    }
  }
  ```
  - Load timing: `DatasetLoader.load_raw("<niah_* or vt_*>")` used in sampling phase and attribution phase (when no cache).

- **Sampling phase (generation & filtering then write cache)**
  ```json
  {
    "prompt": "<same as above>",
    "target": "<thinking + final answer text (no box), no other tail>",
    "indices_to_explain": [start_tok, end_tok],
    "attr_mask_indices": [<sentence indices>...] | null,
    "sink_span": [start_tok, end_tok] | null,
    "thinking_span": [start_tok, end_tok] | null,
    "metadata": {
      "dataset": "ruler",
      "length": <int>,
      "length_w_model_temp": <any>,
      "outputs": [...],
      "answer_prefix": "<str>",
      "token_position_answer": <any>,
      "needle_spans": [...],
      "prompt_sentence_count": <int>,
      "reference_answer": "<outputs concatenation or target>",
      "judge_response": "<True/False text>",
      "boxed_answer": "<optional>"
    }
  }
  ```
  - Generation/judge flow same as `hotpotqa_long`; `target` is trimmed "thinking + final answer text".
  - Written to: `exp/exp2/data/<dataset>.jsonl` (e.g., `niah_mq_q2.jsonl`, `vt_h6_c1.jsonl`).

- **Attribution phase (load cache priority)**
  - Same as `hotpotqa_long`: cache first, otherwise raw; recovery rate (`recovery_ruler`) uses `metadata.needle_spans` (mapped to prompt tokens); multi-hop IFR applies to answer/CoT when `sink_span`/`thinking_span` available.

---

## `indices_to_explain` Convention
- Token-level: `indices_to_explain = [start_tok, end_tok]` (closed interval), coordinate system is generation token indices from `tokenizer(target, add_special_tokens=False)`.
- exp2 recommendation: `indices_to_explain == sink_span`, i.e., boxed content (final answer)'s token span in `target`.

---

## Custom RULER JSONL Path
- If `--dataset` passes an existing JSONL path, `dataset_from_name` parses as RULER file, fields and flow same as RULER series.
- Sampling, attribution phase behavior same as RULER description above, just filename determined by explicit path.

---

## Attribution Phase Load Priority and Effects
- `run_exp.py` load order: `exp/exp2/data/<name>.jsonl` cache > explicitly given JSONL path > raw parsing (MoreHopQA or RULER)
- Recovery rate (`mode=recovery_ruler`) only supports RULER (requires `metadata.needle_spans`), otherwise rejected
- Faithfulness (`mode=faithfulness_gen`) uses generation text; `ifr_multi_hop` does multi-hop on answer/CoT only when `sink_span`/`thinking_span` available, otherwise degrades to entire generation
