# CoT-Evo: Evolutionary Distillation of Chain-of-Thought for Scientific Reasoning

CoT-Evo is a evolutionary framework for Chain-of-Thought (CoT) distillation in scientific domains, particularly chemistry and biology. The system evolves better reasoning chains through crossover and mutation operations guided by fitness evaluation.

## Overview

This project implements a genetic algorithm that:
- **Evolves** Chain-of-Thought reasoning through multiple generations
- **Optimizes** both accuracy and reasoning quality
- **Supports** chemistry (ChemCoTDataset) and biology (BioProBench) tasks
- **Uses** multiple language models for different operations

## Requirements

### Python Dependencies
```bash
pip install -r requirements.txt
```

### API Access
The system requires API access to multiple language models:
- **DeepSeek API** (for reasoning generation)
- **Qwen API** (for crossover operations)  
- **OpenAI API** (for prefix/breakpoint analysis)

### Hardware Requirements
- **GPU**: CUDA-compatible GPU for vLLM embedding model

## Input Data Format

### Data Structure
Your input data should be in JSON format with the following structure:

```json
[
    {
        "id": "unique-identifier",
        "query": "Your scientific question here",
        "raw_cot": "",
        "struct_cot": "structured correct answer",
        "task": "reaction",
        "subtask": "fs",
        "meta": "{\"rxn_cls\": \"reaction type\", ...}",
        "seed_pairs": [
            [
                "<think>Initial reasoning chain</think>\n\nFinal prediction",
                "model-name"
            ]
        ]
    }
]
```
We provide sample data in the `data` folder, which includes two datasets, ChemCoTDataset and BioProBench, and they can be used directly.

### Task Types Supported
- **ChemCoTDataset**: Chemistry reaction prediction, molecule editing, optimization
- **BioProBench**: Biological protocol reasoning

### Seed Pairs Format
Each seed pair contains:
- **Thought chain**: Enclosed in `<think>...</think>` tags
- **Prediction**: Final answer after the thought chain
- **Model name**: Source model identifier (e.g., "qwen3-32b")

## Quick Start

### 1. Prepare Environment Variables
Set up your API credentials:
```bash
export DEEPSEEK_API_BASE="your-deepseek-api-base"
export DEEPSEEK_API_KEY="your-deepseek-api-key"
export QWEN_API_BASE="your-qwen-api-base" 
export QWEN_API_KEY="your-qwen-api-key"
export OPENAI_API_BASE="your-openai-api-base"
export OPENAI_API_KEY="your-openai-api-key"
```

### 2. Prepare Your Data
Place your data in the `/data` directory. Example structure:
```
/data/
├── ChemCoTDataset/
│   └── your-data.json
└── BioProBench/
    └── your-data.json
```

### 3. Run the Genetic Algorithm
```bash
cd ./CoT-Evo

python GA.py \
    --task_name ChemCoTDataset \
    --data_path /data/ChemCoTDataset/your-data.json \
    --save_path ./results/output.jsonl \
    --embedder_model_path /data/path-to-embedding-model \
    --deepseek_tokenizer_path /data/path-to-deepseek-tokenizer \
    --qwen_tokenizer_path /data/path-to-qwen-tokenizer \
    --batch_size 1 \
    --deepseek_api_base $DEEPSEEK_API_BASE \
    --deepseek_api_key $DEEPSEEK_API_KEY \
    --qwen_api_base $QWEN_API_BASE \
    --qwen_api_key $QWEN_API_KEY \
    --openai_api_base $OPENAI_API_BASE \
    --openai_api_key $OPENAI_API_KEY
```

### 4. Parameters Explained
- **task_name**: Choose "ChemCoTDataset" or "BioProBench".
- **data_path**: Path to input JSON file.
- **save_path**: Output path for results (JSONL format).
- **embedder_model_path**: Path to vLLM embedding model.
- **deepseek_tokenizer_path**: Path to DeepSeek tokenizer.
- **qwen_tokenizer_path**: Path to Qwen tokenizer.
- **batch_size**: Samples processed at once (default: 1).
- **deepseek_api_base**: DeepSeek API base URL.
- **deepseek_api_key**: DeepSeek API key.
- **qwen_api_base**: Qwen API base URL.
- **qwen_api_key**: Qwen API key.
- **openai_api_base**: OpenAI API base URL.
- **openai_api_key**: OpenAI API key.