<div align="center">

# CKM-HypoGen

**Continuous Knowledge Metabolism for predictive scientific hypothesis generation**

CKM turns a moving scientific literature into an evolving knowledge state, then proposes structured hypotheses that can be checked against papers published later.

[中文](./README.zh.md) · [Benchmark](./benchmark/) · [Quick Start](#quick-start) · [Workflow](#workflow) · [Documentation](#documentation)

</div>

<p align="center">
  <a href="figures/ckm_framework.pdf">
    <img src="figures/ckm_framework.png" alt="CKM workflow" width="94%">
  </a>
  <br>
  <sub><a href="figures/ckm_framework.pdf">Open the framework figure as PDF</a></sub>
</p>

| 50 topics | ~9k frozen papers | 5.8% future hit rate | 72% topic coverage | 460-day lead |
|---:|---:|---:|---:|---:|
| ML benchmark | arXiv IDs across temporal phases | CKM-Lite | topics with at least one validated hypothesis | RLHF stability case |

## Overview

Most LLM-based discovery workflows are judged immediately, often by another model. CKM is built around a stricter loop: generate hypotheses from literature available up to a time window, then evaluate them against papers that appear after that window.

This repository contains both the runnable workflow and the benchmark used to test it.

| Component | What it provides | Where |
|---|---|---|
| **Agent Runtime plugin** | Skills for collecting papers, updating a knowledge state, generating hypotheses, planning follow-up work, and scheduling recurring research tracking | [`src/`](./src/), [`skills/`](./skills/), [`index.ts`](./index.ts) |
| **CKM Benchmark** | A reproducible predictive-hypothesis benchmark with frozen topics, frozen paper pools, cross-provider judging, and reference summaries | [`benchmark/`](./benchmark/) |
| **Evaluation harness** | Research code for running CKM-Lite / Full / Batch variants and storing detailed per-run reports | [`ckm-eval/`](./ckm-eval/) |

## Quick Start

Use the workflow as an Agent Runtime plugin:

```bash
agent-runtime plugins install ckm-hypogen
agent-runtime gateway
```

Then invoke the research skills inside the chat UI:

```text
/research-collect "long-context modeling"        # build the baseline knowledge state
/metabolism                                       # ingest a new window and emit hypotheses
/research-plan                                    # turn a selected hypothesis into a plan
```

Develop from source:

```bash
git clone https://github.com/<your-fork>/ckm-hypogen.git
cd ckm-hypogen
pnpm install
pnpm build
agent-runtime plugins install -l ./
agent-runtime gateway
```

## Benchmark

Reproduce the published leaderboard summaries without API calls:

```bash
cd benchmark
pip install -r requirements.txt
python -m ckm_benchmark.recompute --summary results/lite_summary.json results/batch_summary.json results/full_summary.json
```

Re-judge a new system submission with the released validation pool:

```bash
cp .env.example .env  # edit OPENAI_API_KEY
python -m ckm_benchmark.rejudge --hypotheses path/to/yours/ --validation-pool data/validation_pool.json --output results/yours.json
```

See [`benchmark/README.md`](./benchmark/README.md) for the problem statement, submission format, evaluation protocol, and leaderboard.

## Workflow

CKM runs as a sliding-window research workflow. Each window adds new papers, updates the current knowledge state, and generates hypotheses grounded in the accumulated state.

| Stage | Skill | Output |
|---|---|---|
| **1. Initialize** | `/research-collect` | A Markdown knowledge state $\mathcal{K}_0$ built from up to 48 historical papers |
| **2. Update** | `/metabolism` | A revised $\mathcal{K}_t$ with new claims, supporting evidence, contradictions, and cross-window changes |
| **3. Generate** | `/metabolism` | Structured hypotheses with source citations, novelty checks, feasibility notes, and self-assessment |

The workflow is tool-augmented: the model consumes deterministic tool outputs from arXiv, OpenAlex, and Unpaywall rather than browsing freely.

<details>
<summary><strong>Available skills and commands</strong></summary>

| Skill | Purpose |
|---|---|
| `/research-collect` | Build the baseline knowledge state |
| `/metabolism-init` | Initialize a fresh metabolism workspace for a topic |
| `/metabolism` | Ingest new papers and emit hypotheses |
| `/idea-generation` | Generate research ideas grounded in recent papers |
| `/research-survey` | Produce a multi-paper survey |
| `/research-plan` | Turn a promising hypothesis into an implementation plan |
| `/research-implement` | Implement a plan and run a short validation |
| `/research-review` | Review and revise a research artefact |
| `/research-experiment` | Run full training and ablation experiments |
| `/research-pipeline` | Chain skills via `sessions_spawn` |
| `/research-subscription` | Schedule daily or weekly research-tracking jobs |

| Command | Purpose |
|---|---|
| `/research-status` | Show the active research workspace |
| `/papers` | List collected papers for the current project |
| `/ideas` | List generated research ideas |
| `/projects` | List research projects |
| `/project-switch <id>` | Switch active project |
| `/research-subscriptions` | Show scheduled research jobs |
| `/research-unsubscribe [job-id]` | Remove a scheduled job |

</details>

## Repository Layout

```text
.
├── benchmark/                # CKM Benchmark v0.1
│   ├── ckm_benchmark/        # Python package: judge, rejudge, recompute
│   ├── data/                 # Frozen topics, arXiv IDs, validation pool
│   ├── docs/                 # Problem statement, evaluation, leaderboard
│   ├── results/              # Reference summaries
│   └── README.md             # Benchmark documentation
├── ckm-eval/                 # Research evaluation harness
│   ├── core/                 # Engines, judge, store
│   ├── tools/                # Search, full text, metrics, visualization
│   ├── scripts/              # Lite / Full / Batch / ablation runners
│   └── results/              # Per-run logs, gitignored
├── skills/                   # Agent Runtime skill prompts
├── src/                      # TypeScript plugin source
├── docs/                     # Architecture and operation notes
├── figures/                  # Framework diagram
├── index.ts                  # Plugin entrypoint
├── agent-runtime.plugin.json # Plugin manifest
└── package.json              # npm package metadata
```

Local-only directories such as `analysis/`, `experiments/`, `paper/`, `reference/`, `ckm-eval/results/`, and caches are gitignored.

## Documentation

| Area | Links |
|---|---|
| Benchmark | [`benchmark/README.md`](./benchmark/README.md), [`PROBLEM_STATEMENT.md`](./benchmark/docs/PROBLEM_STATEMENT.md), [`EVALUATION.md`](./benchmark/docs/EVALUATION.md), [`LEADERBOARD.md`](./benchmark/docs/LEADERBOARD.md) |
| Plugin | [`ARCHITECTURE.md`](./docs/ARCHITECTURE.md), [`AGENT_RUNTIME_CONFIG.md`](./docs/AGENT_RUNTIME_CONFIG.md), [`PLUGIN_CAPABILITIES.md`](./docs/PLUGIN_CAPABILITIES.md) |
| Evaluation harness | [`EXPERIMENTS.md`](./ckm-eval/scripts/EXPERIMENTS.md) |

## Citation

```bibtex
@inproceedings{ckm2026,
  title     = {Continuous Knowledge Metabolism: Generating Scientific Hypotheses from Evolving Literature},
  author    = {[Anonymous, AI4Research workshop submission]},
  booktitle = {AI4Research Workshop at ICML 2026},
  year      = {2026}
}
```

## License

MIT. See [`benchmark/LICENSE`](./benchmark/LICENSE) for benchmark code; the Agent Runtime plugin source is also MIT-licensed.
