# Differentiable Coherent Factuality (DCF)

This repository contains code to reproduce the main MATH dataset results from the paper.

## Requirements

- Python 3.10 (required for torchsort compatibility)
- CUDA-compatible GPU (recommended)

## Installation

```bash
pip install -r requirements.txt
```

Note: torchsort requires Python 3.10 and torch 2.0.1 for compilation.

## Repository Structure

```
submission_package/
├── src/
│   ├── differentiable_conformal_factuality.py  # Core DCF implementation
│   ├── models.py                                # Scorer architectures
│   ├── utilities.py                             # Helper functions
│   ├── debugger.py                              # Debugging utilities
│   ├── reasonining_graph_dataset.py             # Dataset loader
│   └── ablation/
│       ├── comparison_runner.py                 # Main experiment runner
│       ├── methods.py                           # All baseline implementations
│       └── base_method.py                       # Base class for methods
├── config/
│   ├── hyperparam_search_math_all_graph.json    # Hyperparameter search config
│   └── ablation/
│       └── final_comparison.json                # Main comparison config
├── data/
│   └── MATH_generated_annotations_with_nx_features.json  # MATH dataset
├── results/
│   └── math_best_optimization_results.json      # Optimized hyperparameters
└── requirements.txt
```

## Reproducing Main Results (Table 7 - MATH)

### Quick Start: Run Full Comparison

To reproduce the main MATH results comparing DCF against the CF baseline:

```bash
python3.10 -m src.ablation.comparison_runner config/ablation/final_comparison.json
```

This will:
1. Load the MATH dataset (202 problems, 30 features)
2. Run 20-fold cross-validation for each method
3. Evaluate at alpha values: 0.01, 0.02, ..., 0.10
4. Output coverage and retention metrics

Expected runtime: ~30-60 minutes (depending on hardware)

### Expected Output

Results will be saved to `./results/ablation/final/` including:
- `comparison_results.json`: Full numerical results
- Coverage and retention plots (PDF/PNG)

### Key Metrics to Verify (Table 7)

| Alpha | Method | Coverage | Retention (claims) |
|-------|--------|----------|-------------------|
| 0.03  | DCF    | 96.55%   | 1.76              |
| 0.03  | CF     | 97.09%   | 0.73              |
| 0.05  | DCF    | 95.55%   | 2.31              |
| 0.05  | CF     | 94.85%   | 1.44              |
| 0.06  | DCF    | 94.14%   | 3.56              |
| 0.06  | CF     | 94.10%   | 1.74              |

**Note:** The CF baseline uses a fixed `beta_mix=0.5`. The paper reports results with per-alpha optimized `beta_mix` values, which yields slightly higher CF retention. This package reproduces the core DCF vs. CF comparison; the `beta_mix` optimization script is not included.

Retention is reported as percentage in runner output. Multiply by 7.3 (avg claims/problem) to get claim counts.

## Method Descriptions

- **DCF (Ours)**: Differentiable Coherent Factuality with learned scoring
- **CF Baseline**: Coherent Factuality with frequency-based scoring (`beta_mix=0.5`)
- **Independent**: Hashimoto et al. independent claim filtering
- **Boosted Independent**: Learned scoring without graph structure
- **XGBoost**: Standard classifier plugged into CF

## Configuration

The main config file `config/ablation/final_comparison.json` controls:
- Dataset path and features
- Alpha values to evaluate
- Method-specific hyperparameters
- Output settings

Optimized hyperparameters for DCF are stored in `results/math_best_optimization_results.json`.

## Citation

```bibtex
@inproceedings{anonymous2025dcf,
  title={Differentiable Conformal Training for LLM Reasoning Factuality},
  author={Anonymous},
  booktitle={ICML},
  year={2025}
}
```
