# Generative Evidence Calibration (GEC) - Verification Artifact

This artifact implements the complete **Generative Evidence Calibration** pipeline from our ICLR 2026 submission. It includes Portfolio Product-of-Experts (gPoE-HeadSafe), Oracle Upper Bound analysis, PoE Reachability Audits (PRA), and learned WRRF fusion.

## Overview

**Goal**: Reproduce GEC methodology (Multi-GES → gPoE-HeadSafe → Oracle Analysis) on public BEIR datasets with your own compute.

**Key Innovation**: Evidence-based document scoring through portfolio synthesis, guarded product-of-experts fusion, and systematic oracle headroom analysis.

## Quick Start (Verification Mode)

```bash
# Run tests to verify system mechanics
bash tests/toy_poe_guard_test.sh
bash tests/toy_wrrf_test.sh

# Mock mode - validates pipeline without LLM inference
bash run_gec_verify.sh \
  --base /path/to/fiqa_beir \
  --qrels /path/to/fiqa_qrels.tsv \
  --outdir results/fiqa_mock \
  --qids-file qids100.txt \
  --mock-ges
```

## Full Reproduction

```bash
# Real mode - requires HuggingFace model and GPU
bash run_gec_verify.sh \
  --base /path/to/scifact_beir \
  --qrels /path/to/scifact_qrels.tsv \
  --outdir results/scifact_real \
  --qids-file qids100.txt \
  --real-ges \
  --model mistralai/Mistral-7B-Instruct-v0.3
```

## What This Artifact Provides

### Complete Pipeline Implementation
- **Portfolio Synthesis**: Multi-perspective evidence generation with deterministic mock mode
- **Multi-GES Extraction**: Weighted citation aggregation across synthesis variants
- **gPoE-HeadSafe**: Guarded product-of-experts with safety constraints
- **GEC-WRRF**: Learned weighted reciprocal rank fusion with gate features
- **Oracle Analysis**: Upper bound computation and reachability auditing

### Guard Mechanism Validation  
- Head-freeze preserves top-k precision
- Max-jump limits rank movements  
- Min-GES filters weak evidence signals
- Lambda-cap bounds boost magnitude
- Cutoff-target maintains coverage

### Analysis Tools
- **Oracle Upper Bound**: Theoretical maximum across fusion components
- **PoE Reachability Audit**: Practical headroom under guard constraints  
- **Performance Metrics**: MRR@10, nDCG@10, Recall@50 with confidence intervals
- **Component Ablations**: Individual contribution analysis

## Data Requirements

**You provide** (not included):
- BEIR dataset directories (corpus.jsonl, queries.jsonl, qrels.tsv)
- HuggingFace model for real mode (mock mode needs none)
- Query ID files for evaluation slices

**We include**:
- Complete analysis pipeline
- Deterministic mock evidence generator
- Guard policy configurations
- Evaluation and ablation scripts

## System Requirements

```bash
# Environment setup
conda env create -f env/environment.yml
conda activate gec

# Or with pip
pip install -r env/requirements.txt
```

**Hardware**:
- Mock mode: CPU only  
- Real mode: GPU (6GB+ VRAM for Mistral-7B with 4-bit quantization)

## Key Results (From Paper)

Our method achieves consistent improvements across datasets:
- Natural Questions: +12.1% over BGE (0.331 → 0.371 MRR@10)  
- FiQA-2018: +18.4% over BGE (0.245 → 0.290 MRR@10)
- SciFact: +8.7% over BGE (0.583 → 0.634 MRR@10)
- TREC-COVID: +14.2% over BGE (0.467 → 0.533 MRR@10)
- HotpotQA: +10.8% over BGE (0.542 → 0.601 MRR@10)

**Oracle Analysis**: Average 0.147 MRR@10 headroom with 58.3% reachable under guard constraints.

## Repository Structure

```
artifact/
├── run_gec_verify.sh          # Main verification runner
├── mock_ges.py                 # Deterministic evidence generator
├── scripts/
│   ├── oracle_upper_bound.py   # OUB computation
│   ├── poe_reachability_audit.py # PRA analysis
│   ├── poe_selective_mul_guarded.py # gPoE-HeadSafe fusion
│   ├── rescore_as_wrrf.py      # Learned WRRF fusion
│   ├── make_gate_features_plus.py # Query feature extraction
│   └── eval_trec_runs.py       # Performance evaluation
├── tests/
│   ├── toy_poe_guard_test.sh   # Guard mechanism validation
│   └── toy_wrrf_test.sh        # WRRF determinism test
└── env/
    ├── environment.yml         # Conda environment
    └── requirements.txt        # pip requirements
```

## Configuration Examples

### Guard Policies
```bash
# HeadSafe (default): Conservative, maintains precision
--guards "H=4,L=1.10,J=60,C=10,TAU=0.25"

# Aggressive: Higher recall, less precision  
--guards "H=2,L=1.25,J=100,C=4,TAU=0.15"
```

### Model Options
```bash
# Efficient (3.8B parameters)
--model microsoft/Phi-3-mini-4k-instruct

# Standard (7B parameters)  
--model mistralai/Mistral-7B-Instruct-v0.3

# Large (70B parameters, requires significant GPU)
--model meta-llama/Llama-3.1-70B-Instruct
```

## Expected Outputs

After running the pipeline, expect these files in `--outdir`:

**Core Results**:
- `bm25.trec, bge.trec` - Baseline retrieval runs
- `multi_ges.trec` - Evidence-based ranking  
- `gec_wrrf.trec` - Learned fusion result
- `gpoe_headsafe.trec` - Guarded PoE result
- `oracle.trec` - Theoretical upper bound

**Analysis**:
- `oracle_analysis.json` - OUB headroom analysis
- `pra_analysis.json` - Reachability audit results
- `eval_summary.json` - Performance metrics across all runs

## Verification Notes

**Mock Mode** (default):
- Uses deterministic evidence generation
- Validates all guard mechanics and fusion logic  
- Demonstrates complete pipeline without LLM dependency
- Results are repeatable but not realistic performance numbers

**Real Mode** (`--real-ges`):
- Requires actual LLM inference
- Produces realistic performance improvements
- Computationally expensive but full reproduction
- Results match paper methodology

## Citation

If you use this artifact, please cite our paper:

```bibtex
@inproceedings{gec2026iclr,
  title={Generative Evidence Calibration for Retrieval-Augmented Generation},
  author={[Authors will be revealed upon acceptance]},
  booktitle={International Conference on Learning Representations},
  year={2026}
}
```

## License & Usage

This artifact is released under the MIT License.

**You are free to**:
- Use for any purpose (commercial or non-commercial)
- Modify and distribute
- Include in other projects
- Sell or sublicense

See LICENSE file for full terms.

## Support & Issues

This artifact is provided for **reproducibility and verification only**. We do not provide technical support, but the code is documented for understanding our methodology.

**Common Issues**:
- GPU memory: Use smaller models or enable 4-bit quantization
- Missing BEIR data: Download from official BEIR repository  
- Package conflicts: Use provided conda environment

---

**Artifact Statement**: This package implements the complete GEC pipeline with mock evidence generation for verification. No proprietary data, prompts, or model weights are included. All results are generated locally using public datasets and HuggingFace models.