# Failure-Driven Workflow Refinement (CE-Graph)

**Anonymous ICML submission — supplementary code artifact.**

This artifact implements **CE-Graph**, a failure-driven workflow refinement algorithm for tool-using LLM agents.
Rather than optimizing workflows by global search over graphs scored by a scalar metric (which collapses rich traces),
CE-Graph maintains a **counterexample pool**, models failures in a **Failure Signature Space**, identifies **dominant failure modes**,
and applies **operator-constrained graph edits** through a **propose-and-verify** loop.

## What is included
- `verl/ce_graph/`: CE-Graph implementation (workflow DAG, counterexample pool, signature extraction, mode targeting, operators, refinement loop).
- `examples/ce_graph_refine/`: a minimal runnable demo and template runner you can adapt to your benchmark.

## Quickstart (local, no external services)
```bash
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python examples/ce_graph_refine/toy_demo.py
```

The toy demo runs CE-Graph on a tiny synthetic task to verify the control flow:
(1) collect failures → (2) cluster into modes → (3) propose edits from an operator library → (4) verify and commit.

## How to plug into your evaluation
Implement a `runner(workflow)` function that:
1) executes `workflow` on an evaluation set,
2) returns `(score, traces)` where `score` is higher-is-better,
3) and `traces` are `ExecutionTrace` records (success/failure + distilled error info).

Then call:
```python
from verl.ce_graph.refine import refine_workflow

best_wf, reports = refine_workflow(
    workflow=wf0,
    runner=runner,
    operator_lib=operator_lib,
    max_iters=20,
)
```

## Notes on anonymity
Top-level documentation is written for anonymous review and contains **no external links** and **no identifying information**.

