# Assessing Large Language Models in Updating Their Forecasts with New Information (Anonymous Submission Version)

This repository contains a cleaned, anonymous version of our codebase prepared for an ICLR 2026 submission. All identifying information, absolute paths, and hard-coded secrets have been removed. The code is organized into a modular Python package under `iclr/src/`, with clear separation between data processing, inference, and evaluation.

No proprietary credentials are included. Any API keys (e.g., Metaculus, Google Custom Search) must be provided by the user via environment variables or CLI arguments.

An automatic refactor of the codebase has been performed to remove any identifying information, absolute paths, and hard-coded secrets.

---

## Overview

- **Goal**: Analyze the impact of news on binary forecasting questions by attaching relevant comment and news context, running LLM inference, and evaluating directional changes (Up/Down/Still). In particular, EvolveCast assesses whether LLMs adjust their forecasts when presented with information released after their training cutoff.
- **Design**: A modular pipeline with explicit configuration, reusable utilities, and CLI wrappers to reproduce each step without hardcoded paths or secrets.

---

## Package Layout

```
iclr/
├── requirements.txt
├── README.md  ← You are here
└── src/
    ├── __init__.py
    ├── utils/
    │   ├── __init__.py
    │   ├── config.py           # PipelineConfig + env loader (no secrets committed)
    │   ├── dates.py            # Timestamp/date helpers
    │   ├── io.py               # JSON/JSONL helpers, safe writes
    │   ├── logging_utils.py    # Logger factory
    │   └── text.py             # Text parsing primitives
    ├── data/
    │   ├── __init__.py
    │   ├── filtering.py        # Filter Metaculus questions/history
    │   ├── comments.py         # Fetch & attach comments (Metaculus API)
    │   ├── news.py             # Fetch & attach news (Google CSE)
    │   ├── analyze.py          # SBERT scoring, best news selection, early trend
    │   ├── reformat_simple.py  # Flatten analyzed dataset
    │   ├── accumulate.py       # Accumulate news timeline per question
    │   └── reformat_history.py # Attach simplified human-forecast history
    ├── inference/
    │   ├── __init__.py
    │   ├── common.py           # HF model load + text generation with logits
    │   ├── parsing.py          # Answer parsing, weighting, confidence, trends
    │   └── runners.py          # Parameterized inference runners
    ├── evaluation/
    │   ├── __init__.py
    │   ├── metrics.py          # Accuracy/PRF/confusion matrix
    │   ├── logits.py           # Evaluate multi-sample logits outputs
    │   ├── logits_single.py    # Evaluate first-sample-only using token tails
    │   └── verbalized.py       # Evaluate verbalized outputs
    └── cli/
        ├── __init__.py
        ├── data_cli.py         # Data pipeline entrypoints
        ├── inference_cli.py    # Inference entrypoints
        └── eval_cli.py         # Evaluation entrypoints
```

---

## Anonymity and Security

- This is a **cleaned, anonymous** version intended for ICLR 2026 review.
- There are **no hardcoded secrets** or absolute paths.
- Configure all credentials using environment variables (see below) or pass as arguments.

---

## Setup

1. Create a virtual environment and install dependencies:

```bash
python -m venv .venv
source .venv/bin/activate
pip install -r iclr/requirements.txt
```

2. Export environment variables as needed:

```bash
# Optional directories
export ICLR_INPUT_PATH="/path/to/input"
export ICLR_OUTPUT_PATH="/path/to/output"
export ICLR_CACHE_PATH="/path/to/cache"
export ICLR_LOG_PATH="/path/to/logs/pipeline.log"

# APIs (provide your own keys)
export ICLR_METACULUS_API_KEY="<your_metaculus_token>"  # Token format recommended by Metaculus
export ICLR_GOOGLE_API_KEY="<your_google_cse_key>"
export ICLR_GOOGLE_SEARCH_ENGINE_ID="<your_cse_id>"

# Optional: device override (e.g., "cpu" to disable GPU)
export ICLR_DEVICE="cuda"
```

The library loads these via `iclr/src/utils/config.py` with `load_config_from_env()`.

---

## Data Pipeline (CLI)

- **Filter Metaculus questions**

```bash
python -m iclr.src.cli.data_cli filter \
  /path/to/raw_questions.json \
  /path/to/filtered.json \
  --min-forecasters 50 \
  --min-forecaster-count 25
```

- **Fetch and attach comments** (requires `ICLR_METACULUS_API_KEY`)

```bash
python -m iclr.src.cli.data_cli comments \
  /path/to/filtered.json \
  /path/to/comments_raw.json \
  /path/to/with_comments.json \
  --cache /path/to/comments_cache.json
```

- **Fetch and attach news** (requires `ICLR_GOOGLE_API_KEY`, `ICLR_GOOGLE_SEARCH_ENGINE_ID`)

```bash
python -m iclr.src.cli.data_cli news \
  /path/to/with_comments.json \
  /path/to/with_news.json \
  --cache /path/to/google_cache.json \
  --link-cache /path/to/link_cache.json
```

- **Analyze news** (SBERT; caches embeddings to speed up repeated runs)

```bash
python -m iclr.src.cli.data_cli analyze \
  /path/to/with_news.json \
  /path/to/analyzed.json \
  --cache /path/to/encoding_cache.json \
  --model all-MiniLM-L6-v2 \
  --lookahead-days 3 \
  --threshold 0.05
```

- **Reformat (flatten per history)**

```bash
python -m iclr.src.cli.data_cli reformat \
  /path/to/analyzed.json \
  /path/to/reformatted.json
```

- **Accumulate news timeline**

```bash
python -m iclr.src.cli.data_cli accumulate \
  /path/to/reformatted.json \
  /path/to/accumulated.json
```

- **Attach simplified history from binary dump**

```bash
python -m iclr.src.cli.data_cli reformat-history \
  /path/to/accumulated.json \
  /path/to/binary_questions.json \
  /path/to/history_attached.json
```

---

## Inference (CLI)

- **Single-sample verbalized** (records raw outputs; trend can be computed downstream):

```bash
python -m iclr.src.cli.inference_cli verbalized \
  <model_name> \
  /path/to/input.json \
  /path/to/out.json \
  --cache /path/to/cached.jsonl
```

- **Multi-sample with logits** (stores per-token probabilities to support weighting):

```bash
python -m iclr.src.cli.inference_cli sampling \
  <model_name> \
  /path/to/input.json \
  /path/to/out.json \
  --cache /path/to/cached.jsonl \
  --n-samples <N>
```

- **Verbalized with accumulated history context**:

```bash
python -m iclr.src.cli.inference_cli verbalized-history \
  <model_name> \
  /path/to/input_with_history.json \
  /path/to/out.json \
  --cache /path/to/cached.jsonl
```

---

## Evaluation (CLI)

- **Evaluate multi-sample logits**

```bash
python -m iclr.src.cli.eval_cli logits "<glob_pattern_for_jsonl>"
```

- **Evaluate first-sample-only**

```bash
python -m iclr.src.cli.eval_cli logits-first "<glob_pattern_for_jsonl>"
```

- **Evaluate verbalized outputs**

```bash
python -m iclr.src.cli.eval_cli verbalized "<glob_pattern_for_jsonl>"
```

---

## Notes on Reproducibility

- The pipeline writes intermediate artifacts at each stage; you can re-run any stage independently as long as inputs are provided.
- GPU usage is optional; set `ICLR_DEVICE=cpu` to force CPU mode where supported.
- Google CSE results are rate-limited; caching is enabled to avoid redundant requests.

---

## Contact

This is an anonymized submission package for ICLR 2026 review. Please use it solely for reproducibility and artifact evaluation during the review process.
