# Creative Reasoning Project

A comprehensive framework for executing and evaluating creative reasoning workflows with intelligent LLM-based assessment and multi-solution support.

## Overview

This project provides an advanced implementation for creative reasoning workflows, featuring:
- **Dynamic task configuration loading** with structured data models
- **LLM-powered solution extraction** from raw text outputs
- **Intelligent evaluation system** using Gemini models for comprehensive scoring
- **Multi-solution support** with individual tracking and assessment
- **Modular algorithm architecture** with pluggable reasoning models
- **LLM API Client** for multiple providers (OpenAI, Gemini, Claude, DeepSeek)
- **Enhanced evaluation metrics** including feasibility, utility, novelty, and creativity
- **Automated results logging** with solution identification and tracking
- **Robust error handling** and graceful failure modes

## Project Structure

```
creative_reasoning/
├── src/
│   ├── main.py                 # Enhanced workflow orchestration
│   ├── data_models/            # Structured data models
│   │   ├── __init__.py
│   │   ├── task_config.py      # TaskConfig with checkpoints and solutions
│   │   └── evaluation_result.py # EvaluationResult with scoring data
│   ├── algorithms/
│   │   ├── commercialized_reasoning_model/  # Commercialized reasoning model with dynamic LLM support
│   │   │   ├── __init__.py
│   │   │   └── main.py         # reasoning_model class with dynamic LLM integration
│   │   ├── chain_of_thoughts/  # Chain of thoughts algorithm
│   │   │   ├── __init__.py
│   │   │   └── main.py         # Chain of thoughts reasoning model
│   │   ├── tot/                # Tree of thoughts algorithm
│   │   │   ├── __init__.py
│   │   │   └── main.py         # Tree of thoughts reasoning model
│   │   ├── egot/               # Enhanced Graph of Thoughts algorithm
│   │   │   ├── __init__.py
│   │   │   └── main.py         # EGoT reasoning model with nested graph traversal
│   │   ├── combinational_creative_reasoning/  # Combinational reasoning algorithm
│   │   │   ├── __init__.py
│   │   │   └── main.py         # Combinational reasoning model
│   │   ├── exploratory_creative_reasoning/    # Exploratory reasoning algorithm
│   │   │   ├── __init__.py
│   │   │   └── main.py         # Exploratory reasoning model
│   │   └── transformative_creative_reasoning/ # Transformative reasoning algorithm
│   │       ├── __init__.py
│   │       └── main.py         # Transformative reasoning model
│   ├── evaluators/
│   │   ├── __init__.py
│   │   └── run_evaluation.py   # Enhanced evaluation with LLM scoring
│   ├── feasibility_check_points/ # Task-specific feasibility criteria
│   │   ├── bridge.txt          # Feasibility check points for bridge task
│   │   ├── electricity.txt     # Feasibility check points for electricity task
│   │   └── society.txt         # Feasibility check points for society task
│   ├── known_solutions/        # Known solutions for novelty comparison
│   │   ├── bridge.txt          # Known solutions for bridge task
│   │   ├── electricity.txt     # Known solutions for electricity task
│   │   └── society.txt         # Known solutions for society task
│   ├── known_solutions_concept/ # Known solution concepts for evaluation
│   │   ├── bridge.txt          # Known solution concepts for bridge task
│   │   ├── electricity.txt     # Known solution concepts for electricity task
│   │   └── society.txt         # Known solution concepts for society task
│   ├── calibration_anchors/    # Calibration anchors for evaluation
│   │   ├── bridge.txt          # Calibration anchors for bridge task
│   │   ├── electricity.txt     # Calibration anchors for electricity task
│   │   └── society.txt         # Calibration anchors for society task
│   ├── optimal_solutions/      # Optimal solutions for evaluation
│   │   ├── bridge.csv          # Optimal solutions for bridge task
│   │   ├── electricity.csv     # Optimal solutions for electricity task
│   │   └── society.csv         # Optimal solutions for society task
│   ├── utils/
│   │   ├── __init__.py
│   │   ├── llm_api_client.py   # Enhanced LLM client with embedding support
│   │   ├── llm_response_parser.py # LLM response parsing utilities
│   │   ├── mock_llm_client.py  # Mock LLM client for testing
│   │   └── results_logger.py   # Enhanced CSV logging with solution IDs
│   └── tasks/
│       ├── bridge.txt          # Bridge traffic policy task
│       ├── electricity.txt     # Electricity tariff and DR program task
│       └── society.txt         # Community social cohesion intervention task
├── tests/                      # Comprehensive test suite
│   ├── unit/                   # Unit tests for all components
│   │   ├── test_data_models.py
│   │   ├── test_llm_api_client.py
│   │   ├── test_run_evaluation.py
│   │   ├── test_commercialized_reasoning_model_algorithm.py
│   │   ├── test_combinational_creative_reasoning.py
│   │   ├── test_exploratory_creative_reasoning.py
│   │   ├── test_egot_algorithm.py
│   │   ├── test_tot_algorithm.py
│   │   ├── test_chain_of_thoughts_algorithm.py
│   │   ├── test_main.py
│   │   ├── test_main_args.py
│   │   ├── test_main_evaluation_only.py
│   │   ├── test_main_loading_utils.py
│   │   └── test_results_logger.py
│   └── integration/            # Integration tests
│       ├── test_main_integration.py
│       ├── test_commercialized_reasoning_model_integration.py
│       ├── test_egot_workflow.py
│       └── test_evaluation_only_integration.py
├── results/                     # Generated results (auto-created)
│   ├── results.csv             # Enhanced results log with solution IDs
│   └── final_results/          # Final analysis results
│       ├── results_sensitivity_analysis_num_analogous_problem.csv
│       └── results_sensitivity_analysis_num_solution.csv
├── requirements.txt             # Python dependencies
├── pytest.ini                  # Pytest configuration
├── env.template                # Environment variables template
├── run_all_algorithms.sh       # Run all algorithms script
├── run_all_algorithms_bridge_claude.sh # Bridge task with Claude script
├── run_all_algorithms_electricity_gpt4o.sh # Electricity task with GPT-4o script
├── run_all_algorithms_society_gpt4o.sh # Society task with GPT-4o script
├── run_all_algorithms_sensitivity_analysis.sh # Sensitivity analysis script
└── LICENSE                     # Project license
```

## Quick Start

### Prerequisites

- Python 3.7+
- **Required dependencies: `openai`, `google-generativeai`, `anthropic`, `requests`, `python-dotenv`, `numpy`, `pydantic`**
- **API keys for desired LLM providers (GEMINI_API_KEY required for evaluation)**

### Environment Setup

1. Copy `env.template` to `.env`
2. Fill in your API keys:
   ```bash
   # Required for evaluation system
   GEMINI_API_KEY=your_gemini_api_key_here
   
   # Required for commercialized_reasoning_model algorithm
   OPENAI_API_KEY=your_openai_api_key_here
   
   # Optional for other providers
   CLAUDE_API_KEY=your_claude_api_key_here
   DEEPSEEK_API_KEY=your_deepseek_api_key_here
   ```

### Running a Workflow

#### Basic Usage
```bash
python -m src.main --task-name bridge --algorithm-name commercialized_reasoning_model --backbone-llm-name deepseek-reasoner
```

#### Advanced Algorithms
```bash
# Chain of Thoughts
python -m src.main --task-name bridge --algorithm-name chain_of_thoughts --backbone-llm-name gpt-4

# Tree of Thoughts
python -m src.main --task-name bridge --algorithm-name tot --backbone-llm-name gpt-4 --num-thoughts-per-step 5 --search-depth 3

# Enhanced Graph of Thoughts (EGoT) - NEW!
python -m src.main --task-name bridge --algorithm-name egot --backbone-llm-name gpt-4 --num-thoughts-per-step 5 --search-depth 3

# Combinational Creative Reasoning
python -m src.main --task-name bridge --algorithm-name combinational_creative_reasoning --backbone-llm-name gpt-4 --num-analogous-problems 10 --num-solutions-per-problem 5 --num-final-solutions 3 --num-solutions-combinational 20

# Exploratory Creative Reasoning
python -m src.main --task-name bridge --algorithm-name exploratory_creative_reasoning --backbone-llm-name gpt-4 --num-exploratory-ideas 50 --num-analogous-problems 10 --num-solutions-per-problem 5 --num-final-solutions 3 --num-solutions-combinational 20

# Transformative Creative Reasoning
python -m src.main --task-name bridge --algorithm-name transformative_creative_reasoning --backbone-llm-name gpt-4 --num-new-rule-sets 3 --num-analogous-problems 10 --num-solutions-per-problem 5 --num-exploratory-ideas 50 --num-final-solutions 3 --num-solutions-combinational 20
```

This will:
1. Load task configuration from multiple structured files
2. Execute the algorithm to generate solutions
3. **Extract individual solutions using LLM**
4. **Score each solution for feasibility, utility, and novelty**
5. **Calculate overall creativity scores**
6. **Log results with unique solution identifiers**

#### Re-evaluation Mode

Re-evaluate existing solutions in `results.csv` without running the full workflow:

```bash
python -m src.main --rerun-evaluation
```

This mode will:
- Create a timestamped backup of `results.csv` (e.g., `results_before_evaluation_2025-01-15_14-30-45.csv`)
- Load existing solutions from `results.csv`
- Dynamically load `TaskConfig` components for each solution's task
- Re-run evaluation using the existing `run_id` from intermediate log filenames
- Update all evaluation fields (scores, reasoning, themes) with new results
- Save updated results back to `results.csv`
- Generate new evaluation-specific intermediate log files

**Use cases:**
- Re-evaluate solutions with updated evaluation criteria
- Re-run evaluation after fixing bugs in the evaluation system
- Update scores when new calibration anchors or optimal solutions are added
- Refresh evaluation data without regenerating solutions

### Command Line Arguments

- `--task-name`: Name of the task (loads from `src/tasks/`, `src/feasibility_check_points/`, and `src/known_solutions/`)
- `--algorithm-name`: Name of the algorithm to use (must exist in `src/algorithms/`)
- `--backbone-llm-name`: Name of the backbone LLM (for algorithm context)
- `--num-analogous-problems`: Number of analogous problems to find (default: 10)
- `--num-solutions-per-problem`: Number of solutions per analogous problem (default: 5)
- `--num-exploratory-ideas`: Number of exploratory ideas to generate for exploratory creative reasoning (default: 50)
- `--num-new-rule-sets`: Number of new rule sets to generate for transformative creative reasoning (default: 3)
- `--num-final-solutions`: Number of final solutions to generate or select (default: 3)
- `--num-solutions-combinational`: Number of new solutions to synthesize by the Combinational Creative Reasoning algorithm (default: 20)
- `--num-thoughts-per-step`: Number of thoughts per step (for tot and egot algorithms) (default: 5)
- `--search-depth`: Search depth (for tot and egot algorithms) (default: 3)
- `--rerun-evaluation`: Re-evaluate existing solutions in results.csv without running the full workflow

## Available Tasks

The framework includes three comprehensive tasks for creative reasoning evaluation:

### 1. Bridge Task (`bridge`)
**One-Lane Bridge Traffic Policy** - Design a mobility policy that minimizes average delay for vehicles crossing a single-lane bridge with fixed physical parameters and safety constraints.

### 2. Electricity Task (`electricity`) 
**Residential Electricity Tariff and DR Program** - Design a comprehensive electricity tariff and Demand Response program that reduces peak load on residential feeder infrastructure while maintaining critical medical load safety.

### 3. Society Task (`society`)
**Community Social Cohesion Intervention** - Design a multi-faceted intervention package that strengthens cross-group cohesion through measurable improvements in social integration metrics.

Each task includes:
- Detailed context and objectives
- Hard constraints that must be respected
- Specific output format requirements
- Associated feasibility check points, known solutions, calibration anchors, and optimal solutions

## Available Algorithms

The framework supports multiple creative reasoning algorithms, each with different approaches to problem-solving:

### 1. Commercialized Reasoning Model (`commercialized_reasoning_model`)
**Dynamic LLM-based reasoning** that supports multiple backbone models for flexible problem solving.

```bash
python -m src.main --task-name bridge --algorithm-name commercialized_reasoning_model --backbone-llm-name gpt-4o
```

**Features:**
- **Dynamic LLM selection** - supports any backbone LLM (GPT-4o, Claude, DeepSeek, o1, etc.)
- **Configurable temperature** - uses 0.7 for generative tasks (automatically omitted for o1 model)
- **Detailed logging** - captures comprehensive LLM call information
- **Flexible deployment** - can adapt to different LLM providers and models
- Good for baseline comparisons and production deployments

**Supported Backbone LLMs:**
```bash
# OpenAI models
python -m src.main --task-name bridge --algorithm-name commercialized_reasoning_model --backbone-llm-name gpt-4o
python -m src.main --task-name bridge --algorithm-name commercialized_reasoning_model --backbone-llm-name gpt-5
python -m src.main --task-name bridge --algorithm-name commercialized_reasoning_model --backbone-llm-name o1

# Claude models
python -m src.main --task-name bridge --algorithm-name commercialized_reasoning_model --backbone-llm-name claude-3.5-sonnet

# DeepSeek models
python -m src.main --task-name bridge --algorithm-name commercialized_reasoning_model --backbone-llm-name deepseek-reasoner
```

**Temperature Handling:**
- **Standard models** (GPT-4o, Claude, DeepSeek): Uses `temperature=0.7` for creative generation
- **o1 and GPT-5 models**: Automatically omits temperature parameter (not supported by these models)
- **Dynamic adaptation**: The system automatically handles model-specific requirements

### 2. Chain of Thoughts (`chain_of_thoughts`)
**Sequential reasoning** that breaks down problems into step-by-step solutions.

```bash
python -m src.main --task-name bridge --algorithm-name chain_of_thoughts --backbone-llm-name gpt-4
```

**Features:**
- Step-by-step problem decomposition
- Sequential reasoning chains
- Good for structured problem solving

### 3. Tree of Thoughts (`tot`)
**Branching reasoning** that explores multiple solution paths simultaneously.

```bash
python -m src.main --task-name bridge --algorithm-name tot --backbone-llm-name gpt-4 --num-thoughts-per-step 5 --search-depth 3
```

**Features:**
- Multiple parallel reasoning branches
- Tree-based exploration of solution space
- Backtracking and pruning capabilities
- Configurable search depth and thoughts per step

### 4. Enhanced Graph of Thoughts (`egot`) ⭐ **NEW!**
**Advanced graph-based reasoning** with nested traversal and dynamic temperature control.

```bash
python -m src.main --task-name bridge --algorithm-name egot --backbone-llm-name gpt-4 --num-thoughts-per-step 5 --search-depth 3
```

**Features:**
- **Graph-based reasoning structure** with multiple node types (Method, Answering, Evaluation, Aggregate)
- **Nested graph traversal** where each node generates multiple child nodes
- **Dynamic temperature control** using cosine annealing based on confidence scores
- **Exponential solution generation** through nested loops (fixed depth of 3 levels)
- **Multi-root exploration** with 3 independent graph traversals
- **Confidence-based solution ranking** using `s * Pr(s)` scoring
- **No intermediate logging** (optimized for performance)

**Algorithm Structure:**
1. **Method Node**: Analyzes the problem and generates method analysis
2. **Answering Node**: Generates creative solutions with dynamic temperature
3. **Evaluation Node**: Scores solutions for quality and confidence
4. **Aggregate Rationale Node**: Combines rationales for next iteration

**Key Parameters:**
- `graph_depth`: Fixed at 3 levels for controlled exploration
- `num_root_nodes`: 3 independent starting points
- `num_final_solutions`: Number of solutions generated per node
- `tmax`: Maximum temperature for dynamic control (0.7)
- `e`: Temperature scaling factor (0.1)
- `threshold_extreme`: High confidence threshold (70)
- `threshold_normal`: Normal confidence threshold (50)

**Temperature Calculation:**
- Uses cosine annealing: `tu = tmin + 0.5 * (tmax - tmin) * (1 + cos(Nc / Nt))`
- `c = s * (Pr(s) ** (1/e))` for confidence-based scaling
- `tmin = 1 - sqrt(1 - (c - 1)^2)` for bounded temperature range

### 5. Combinational Creative Reasoning (`combinational_creative_reasoning`)
**Analogical reasoning** that finds similar problems and combines their solutions.

```bash
python -m src.main --task-name bridge --algorithm-name combinational_creative_reasoning --backbone-llm-name gpt-4 --num-analogous-problems 10 --num-solutions-per-problem 5 --num-final-solutions 3 --num-solutions-combinational 20
```

**Features:**
- Finds analogous problems from different domains
- Extracts solutions from analogous problems
- Combines ideas through creative recombination
- Configurable number of analogous problems and solutions
- Configurable number of final solutions and combinational solutions

### 6. Exploratory Creative Reasoning (`exploratory_creative_reasoning`)
**Extended combinational reasoning** with exploratory idea expansion.

```bash
python -m src.main --task-name bridge --algorithm-name exploratory_creative_reasoning --backbone-llm-name gpt-4 --num-exploratory-ideas 50 --num-analogous-problems 10 --num-solutions-per-problem 5 --num-final-solutions 3 --num-solutions-combinational 20
```

**Features:**
- All features of combinational reasoning
- **Exploratory idea expansion** from diverse domains
- Generates ideas that are "far in surface level but similar in role/functionality"
- Enhanced creativity through cross-domain inspiration
- Configurable number of exploratory ideas and other parameters

### 7. Transformative Creative Reasoning (`transformative_creative_reasoning`) ⭐ **NEW!**
**Rule mutation approach** that transforms problem constraints and generates diverse solutions.

```bash
python -m src.main --task-name bridge --algorithm-name transformative_creative_reasoning --backbone-llm-name gpt-4 --num-new-rule-sets 3 --num-analogous-problems 10 --num-solutions-per-problem 5 --num-exploratory-ideas 50 --num-final-solutions 3 --num-solutions-combinational 20
```

**Features:**
- **Rule exposure**: Identifies explicit and hidden problem constraints
- **Analogous rule discovery**: Finds rules from diverse domains (biology, computer science, architecture, etc.)
- **Rule mutation**: Generates new rule sets through:
  - **DROP**: Remove original rules
  - **VARY**: Replace with analogous rules from different domains
  - **ADD**: Incorporate new rules from analogous domains
- **Iterative exploration**: Runs exploratory reasoning on each new rule set
- **Solution aggregation**: Ranks and selects top solutions across all rule sets
- **Comprehensive logging**: Tracks all intermediate steps and LLM interactions

**Workflow:**
1. **Expose Rules**: Identify all explicit and hidden problem constraints
2. **Find Analogous Rules**: Discover similar rules from diverse domains
3. **Generate New Rule Sets**: Create mutated rule sets through DROP/VARY/ADD operations
4. **Exploratory Reasoning**: Run exploratory creative reasoning on each new rule set
5. **Aggregate Solutions**: Rank and select the top k solutions across all rule sets

**Advanced Parameters:**
- `--num-new-rule-sets`: Number of mutated rule sets to generate (default: 3)
- `--num-analogous-problems`: Number of analogous problems per rule set (default: 10)
- `--num-solutions-per-problem`: Number of solutions per analogous problem (default: 5)
- `--num-exploratory-ideas`: Number of exploratory ideas per rule set (default: 50)

## Enhanced Evaluation System

### Multi-Solution Support

The system now automatically:
- **Extracts multiple solutions** from raw algorithm output using `gemini-2.5-pro`
- **Assigns unique IDs** to each solution for tracking
- **Evaluates each solution independently** across all metrics
- **Logs results separately** with solution identification

### Intelligent Scoring

#### Feasibility Scoring
- **LLM-based evaluation** against task-specific check points
- Uses `gemini-2.5-pro` with `temperature=0` for consistency
- Scores from 0.0 (no check points met) to 1.0 (all check points met)

#### Utility Scoring
- **Task relevance assessment** using LLM understanding
- Evaluates how well solutions address the core problem
- Scores from 0.0 (completely useless) to 1.0 (perfect solution)

#### Novelty Scoring
- **Semantic similarity analysis** using `gemini-embedding-001`
- Compares against known solutions using cosine similarity
- Novelty = 1.0 - max_similarity (higher = more novel)

#### Creativity Scoring
- **Weighted combination** of all metrics:
  - Feasibility: 30%
  - Utility: 30%
  - Novelty: 40%

### Configuration Files

#### Task Descriptions

**Bridge Task (Task 1: One-Lane Bridge Traffic Policy)**
```txt
# src/tasks/bridge.txt
### Task 1: One-Lane Bridge Traffic Policy

* **Context:** A single-lane bridge connects two points, allowing traffic in only one direction at a time. A known number of vehicles make daily round trips for work and grocery purposes across five distinct time windows. The system has fixed physical parameters: a minimum vehicle headway, a constant per-vehicle crossing time, and a penalty delay for switching traffic direction.
* **Objective:** Design a mobility policy that minimizes the overall average delay for all vehicles.
* **Constraints:** The policy must operate within two hard constraints: 1) No new physical infrastructure (e.g., lanes, bridges) can be built. 2) The safety protocol dictating that only one direction can use the bridge at any given moment must never be violated.

Present your final solution as a single, self-contained paragraph that functions as an operational blueprint. Focus on describing the core functional mechanism by identifying the essential components, their inputs and outputs, and how they interact to produce the final result. The level of detail must be sufficient to make the process unambiguous and reproducible in principle, but should not be a granular, step-by-step implementation guide. Omit all introductory framing, meta-commentary, or discussion of advantages and disadvantages.
IMPORTANT: In order for fair comparison later, your final solution should have one main core component/mechanism and should not add multiple components to achieve incremental improvement over multiple components.
```

**Electricity Task (Task 2: Residential Electricity Tariff and DR Program)**
```txt
# src/tasks/electricity.txt
### Task 2: Residential Electricity Tariff and DR Program

* **Context:** A residential neighborhood electrical feeder experiences significant strain on its infrastructure due to high consumption during predictable peak hours (late afternoon and early evening).
* **Objective:** Design a comprehensive electricity tariff and Demand Response (DR) program that reduces the daily peak load on the feeder's infrastructure by either shifting usage to off-peak hours or encouraging conservation.
* **Constraints:** The program must operate within two hard constraints: 1) No new physical capacity (e.g., larger transformers, upgraded power lines) can be added. 2) The program must never curtail power to designated critical medical loads, ensuring the safety of vulnerable residents.

**Present your final solution as a concise, self-contained paragraph. Focus on describing the core strategic concept and its mechanics, omitting any introductory framing, meta-commentary, or discussion of its supposed advantages and disadvantages.**
IMPORTANT: In order for fair comparison later, your final solution should have one main core component/mechanism and should not add multiple components to achieve incremental improvement over multiple components.
```

**Society Task (Task 3: Community Social Cohesion Intervention)**
```txt
# src/tasks/society.txt
### Task 3: Community Social Cohesion Intervention

* **Context:** A community is composed of several distinct social groups. An intervention will be deployed over a fixed period of T weeks to improve social integration.
* **Objective:** Design a multi-faceted intervention package that strengthens cross-group cohesion. Success is measured by tangible improvements in three metrics: the number of new cross-group ties, the frequency of cross-group mixing, and the volume of cross-group mutual aid.
* **Constraints:** The program must operate within three hard ethical constraints: 1) Participant privacy must be protected via opt-in consent with no PII exposure. 2) The design must be inclusive, ensuring no group is systematically excluded. 3) A safe environment for all interactions must be maintained.

**Present your final solution as a concise, self-contained paragraph. Focus on describing the core strategic concept and its mechanics, omitting any introductory framing, meta-commentary, or discussion of its supposed advantages and disadvantages.**
IMPORTANT: In order for fair comparison later, your final solution should have one main core component/mechanism and should not add multiple components to achieve incremental improvement over multiple components.
```

#### Feasibility Check Points
```txt
# src/feasibility_check_points/bridge.txt
The solution must be physically possible to construct
The solution must use available materials
The solution must be stable and safe
The solution must span the required distance
The solution must support the required weight
```

#### Known Solutions
```txt
# src/known_solutions/bridge.txt
A simple wooden plank bridge
A rope bridge with wooden planks
A stone arch bridge
A suspension bridge with cables
A truss bridge made of metal beams
```

## Architecture

### Main Orchestrator (`src/main.py`)

The enhanced workflow controller that:
1. **Loads comprehensive task configuration** from multiple structured files
2. **Creates TaskConfig objects** with all necessary data
3. **Executes algorithms** to generate solutions
4. **Runs enhanced evaluation** with LLM-based scoring
5. **Logs multiple results** with solution identification

### Data Models

#### TaskConfig
```python
from src.data_models.task_config import TaskConfig

config = TaskConfig(
    feasibility_check_points=["Check 1", "Check 2"],
    task_description="Task description",
    known_solutions=["Solution 1", "Solution 2"]
)
```

#### EvaluationResult
```python
from src.data_models.evaluation_result import EvaluationResult

result = EvaluationResult(
    original_solution_id="sol_1_abc123",
    feasibility_score=0.8,
    utility_score=0.9,
    novelty_score=0.7,
    creativity_score=0.79
)
```

### Enhanced LLM API Client (`src/utils/llm_api_client.py`)

A unified client with **embedding support**:

```python
from src.utils.llm_api_client import LLMAPIClient

client = LLMAPIClient()

# Text generation
solution = client.call_gemini("Your prompt", "gemini-2.5-pro", temperature=0)

# Text embeddings for novelty scoring
embeddings = client.embed_content("gemini-embedding-001", ["text1", "text2"])
```

**New Features:**
- **Embedding generation** using `gemini-embedding-001`
- **Batch processing** for multiple texts
- **Semantic similarity** calculations
- **Enhanced error handling** for embedding operations

### Enhanced Evaluation Engine (`src/evaluators/run_evaluation.py`)

The core evaluation system that:
1. **Extracts solutions** using LLM parsing
2. **Calculates feasibility scores** against check points
3. **Assesses utility** based on task relevance
4. **Measures novelty** using semantic embeddings
5. **Computes creativity** as weighted combination
6. **Returns structured results** for each solution

## Results and Logging

### Enhanced CSV Output

Results are logged to `results/results.csv` with columns:
- `datetime`: ISO format timestamp
- `algorithm_name`: Name of the algorithm used
- `task_name`: Name of the task solved
- `solution`: Generated solution text
- `feasibility_score`: Feasibility evaluation score (0.0-1.0)
- `utility_score`: Utility evaluation score (0.0-1.0)
- `novelty_score`: Novelty evaluation score (0.0-1.0)
- `creativity_score`: Overall creativity score (0.0-1.0)
- **`original_solution_id`**: Unique identifier for each solution

### Multiple Solution Handling

Each extracted solution is logged as a separate row, enabling:
- **Individual solution tracking** across evaluation runs
- **Performance analysis** of different solution approaches
- **Comparative assessment** of algorithm outputs
- **Detailed reporting** for research and analysis

## Adding New Tasks

1. **Create task description** in `src/tasks/{task_name}.txt`
2. **Define feasibility check points** in `src/feasibility_check_points/{task_name}.txt`
3. **List known solutions** in `src/known_solutions/{task_name}.txt`
4. **Use the task** with: `--task-name {task_name}`

### Example Task Setup

```bash
python -m venv .venv
# Linux/macOS:
source .venv/bin/activate
# Windows (PowerShell):
# .\samurai-agent\Scripts\Activate.ps1
pip install -r requirements.txt
```

```bash
# Create task files
echo "Design a sustainable energy solution" > src/tasks/energy.txt
echo "Must use renewable resources" > src/feasibility_check_points/energy.txt
echo "Solar panels" > src/known_solutions/energy.txt
echo "Wind turbines" >> src/known_solutions/energy.txt

# Run workflow with different algorithms
python -m src.main --task-name bridge --algorithm-name commercialized_reasoning_model --backbone-llm-name gpt-4o

python -m src.main --task-name bridge --algorithm-name combinational_creative_reasoning --backbone-llm-name gpt-4o --num-analogous-problems 10 --num-solutions-per-problem 5 --num-final-solutions 3 --num-solutions-combinational 20

python -m src.main --task-name bridge --algorithm-name exploratory_creative_reasoning --backbone-llm-name gpt-4o --num-analogous-problems 10 --num-solutions-per-problem 5 --num-exploratory-ideas 50 --num-final-solutions 3 --num-solutions-combinational 20

```

### Automated Scripts

The project includes several shell scripts for automated execution:

#### Run All Algorithms Script
```bash
# Run all algorithms with default parameters
./run_all_algorithms.sh
```

This script executes all available algorithms sequentially with optimized parameters for comprehensive evaluation.

#### Task-Specific Scripts
```bash
# Run all algorithms on bridge task with Claude
./run_all_algorithms_bridge_claude.sh

# Run all algorithms on electricity task with GPT-4o
./run_all_algorithms_electricity_gpt4o.sh

# Run all algorithms on society task with GPT-4o
./run_all_algorithms_society_gpt4o.sh
```

#### Sensitivity Analysis Script
```bash
# Run sensitivity analysis on algorithm parameters
./run_all_algorithms_sensitivity_analysis.sh
```

This script performs systematic parameter sensitivity analysis to understand how different parameter values affect algorithm performance.

```

## Testing

### Comprehensive Test Suite

```bash
# Run all tests
pytest

# Run specific test categories
pytest tests/unit/           # Unit tests for all components
pytest tests/integration/    # Integration tests for workflows

# Run with verbose output
pytest -v

# Run specific test files
pytest tests/unit/test_run_evaluation.py -v
```

### Test Coverage

**Unit Tests (50+ tests):**
- Data model validation and boundary testing
- LLM API client functionality and error handling
- Evaluation function logic and edge cases
- Algorithm integration and error propagation
- Commercialized reasoning model algorithm testing
- Chain of thoughts algorithm testing
- Tree of thoughts algorithm testing
- Enhanced Graph of Thoughts (EGoT) algorithm testing
- Combinational creative reasoning algorithm testing
- Exploratory creative reasoning algorithm testing
- Transformative creative reasoning algorithm testing

**Integration Tests (14+ tests):**
- Complete workflow execution
- Configuration file loading
- Cross-component functionality
- Error handling scenarios
- EGoT workflow integration testing

**All tests pass with 100% success rate**

## Dependencies

### Core Dependencies
- `openai>=1.0.0` - OpenAI API client
- `google-generativeai>=0.3.0` - Google Gemini API client (required for evaluation)
- `anthropic>=0.7.0` - Anthropic Claude API client
- `requests>=2.31.0` - HTTP client for DeepSeek API
- `python-dotenv>=0.19.0` - Environment variable management

### Data Processing Dependencies
- `numpy>=1.21.0` - Efficient vector operations and cosine similarity
- `pandas>=1.3.0` - Data manipulation and analysis
- `pydantic>=2.0.0` - Data validation and serialization

### Testing Dependencies
- `pytest>=7.0.0` - Testing framework
- `pytest-mock>=3.10.0` - Mocking utilities

## Performance and Scalability

### LLM Integration
- **Deterministic outputs** using `temperature=0` for consistency
- **Batch embedding processing** for efficient novelty scoring
- **Graceful fallbacks** to prevent system failures
- **Error recovery** with default scoring when APIs fail

### Memory and Processing
- **Efficient vector operations** using NumPy
- **Streaming evaluation** for large solution sets
- **Configurable scoring weights** for different evaluation priorities

## Future Enhancements

- **✅ Enhanced evaluation system (COMPLETED)**
- **✅ Multi-solution support (COMPLETED)**
- **✅ LLM-based scoring (COMPLETED)**
- **✅ Structured data models (COMPLETED)**
- **✅ Comprehensive testing (COMPLETED)**
- **✅ Chain of Thoughts algorithm (COMPLETED)**
- **✅ Tree of Thoughts algorithm (COMPLETED)**
- **✅ Enhanced Graph of Thoughts (EGoT) algorithm (COMPLETED)**
- **✅ Combinational Creative Reasoning (COMPLETED)**
- **✅ Exploratory Creative Reasoning (COMPLETED)**
- **✅ Transformative Creative Reasoning (COMPLETED)**
- **Advanced prompt engineering** for better LLM responses
- **Embedding caching** for improved performance
- **Configurable scoring weights** and thresholds
- **Real-time evaluation** during algorithm execution
- **Cost tracking** and optimization for LLM usage
- **Web interface** for workflow management
- **Performance benchmarking** and optimization
- **Additional rule mutation strategies** for transformative reasoning
- **Cross-algorithm comparison tools** for performance analysis

## Troubleshooting

### Common Issues

1. **Missing GEMINI_API_KEY**: Required for evaluation system
2. **LLM API failures**: System gracefully falls back to default scores
3. **File not found errors**: Ensure all configuration files exist for the task
4. **Import errors**: Install required dependencies from `requirements.txt`
5. **Algorithm name changes**: The `o1` algorithm has been renamed to `commercialized_reasoning_model` for better clarity and dynamic LLM support

### Error Recovery

The system includes comprehensive error handling:
- **LLM failures** → Default scoring with fallback values
- **File errors** → Descriptive error messages and graceful exit
- **Validation errors** → Clear feedback on configuration issues
- **API rate limits** → Automatic retry and fallback mechanisms

## License

See LICENSE file for details.

## Contributing

1. Follow the established architecture patterns
2. Add comprehensive tests for new functionality
3. Update documentation for any API changes
4. Ensure all tests pass before submitting changes
5. Follow the error handling and validation patterns

## Support

For issues and questions:
1. Check the comprehensive test suite for usage examples
2. Review the `IMPLEMENTATION_SUMMARY.md` for technical details
3. Examine the error messages for troubleshooting guidance
4. Verify all configuration files are properly set up
