# Trip Score

A comprehensive evaluation system for AI-powered trip planning agents, supporting multiple agent types and evaluation methodologies.

## Table of Contents

- [Overview](#overview)
- [Quick Start](#quick-start)
- [Requirements](#requirements)
- [LLM Configuration](#llm-configuration)
- [Flexible Experiment Runner](#flexible-experiment-runner-run_exp_flexiblepy)
- [Data Format Description](#data-format-description)
- [FAQ](#faq)
- [Important Notes](#important-notes)

## Overview

This system provides a flexible framework for evaluating different AI agents in trip planning tasks. It supports various agent types including Direct, LLMNeSy, RuleNeSy, and LLM-modulo approaches, with comprehensive evaluation metrics covering format validation, commonsense reasoning, soft constraints, user preferences, and more.

### Key Features

- **Multiple Agent Support**: Direct, LLMNeSy, RuleNeSy, LLM-modulo agents
- **Flexible Evaluation**: Separate generation and evaluation modes
- **Comprehensive Metrics**: Format, commonsense, soft constraints, preferences, user requests
- **LLM Integration**: Support for multiple language models (GPT-4o, Gemini, DeepSeek, Qwen)
- **Batch Processing**: Efficient handling of large datasets
- **Detailed Logging**: Comprehensive logging and debugging information

### Project Structure

```
Trip Score/
├── agent/                    # Agent implementations
│   ├── base.py              # Base agent class
│   ├── load_model.py        # Model loading utilities
│   ├── direct/              # Direct agent
│   ├── cot/                 # Chain-of-Thought agent
│   ├── nesy_agent/          # NeSy agent with logging
│   ├── llm_modulo/          # LLM-modulo agent
│   ├── hypertree/           # HyperTree agent
│   └── ttg_agent/           # TTG agent
├── evaluators/              # Evaluation modules
│   ├── main_evaluator.py    # Main evaluation coordinator
│   ├── format_evaluator.py  # Format validation
│   ├── commonsense_evaluator.py # Commonsense reasoning
│   ├── soft_constraint_evaluator.py # Soft constraints
│   ├── preference_evaluator.py # User preferences
│   ├── user_request_evaluator.py # User requests
│   └── ...                  # Other evaluators
├── utils/                   # Utility functions
│   ├── llms.py              # LLM utilities
│   ├── poi_analyzer.py      # POI analysis
│   ├── time_utils.py        # Time utilities
│   └── ...                  # Other utilities
├── config/                  # Configuration files
│   └── llms_config.json     # LLM configurations
├── data/                    # Data loading utilities
├── data_base/               # Base datasets
├── requirements.txt         # Dependencies
└── run_exp_flexible.py      # Main experiment runner
```

## Quick Start

1. **Install Dependencies**:
   ```bash
   pip install -r requirements.txt
   ```

2. **Configure LLM APIs**:
   Edit `config/llms_config.json` with your API keys

3. **Run Basic Experiment**:

- for synthetic
   ```bash
   python run_exp_flexible.py --mode generate_only --agent Direct --llm gemini --splits synthesis
   ```
- for real-world
   ```bash
   python run_exp_flexible.py --mode generate_only --agent Direct --llm gemini --splits generalized
   ```

4. **Evaluate Results**:

- for synthetic
  ```bash
  python run_exp_flexible.py --mode evaluate_only --agent Direct --llm gemini  --enable_LLM --splits synthesis
  ```
  
- for real-world
   ```bash
   python run_exp_flexible.py --mode evaluate_only --agent Direct --llm gemini  --enable_LLM --enable_user_request_eval --splits generalized
   ```

## LLM Configuration

### Configuration File Description

The system uses the `config/llms_config.json` file to manage configuration information for all large language models. This file contains API endpoints and key configurations for different LLMs.

#### Supported LLM Models

| Model Name | Config Key | API Type | Description |
|------------|------------|----------|-------------|
| Qwen3-14B | `Qwen3-14B` | Internal Proxy | 14B parameter Qwen model |
| Qwen3-8B | `Qwen3-8B` | Internal Proxy | 8B parameter Qwen model |
| Qwen3-32B | `Qwen3-32B` | Internal Proxy | 32B parameter Qwen model |
| DeepSeek | `deepseek-v3` | DeepSeek API | DeepSeek-V3 model |
| GPT-4o | `796-gpt-4o__2024-11-20` | OpenAI API | GPT-4o model |
| Gemini | `gemini-2.5-flash-latest` | Google API | Gemini 2.5 Flash model |
| Custom Model | `customized` | Custom API | User-defined LLM endpoint |

#### Configuration Format

```json
{
    "model_name": {
        "url": "API_endpoint_URL",
        "key": "API_key"
    }
}
```

#### Configuration Example

```json
{
    "Qwen3-14B": {
        "url": "your-api-url-here",
        "key": "your-api-key-here"
    },
    "deepseek-chat": {
        "url": "your-url",
        "key": "your-deepseek-key-here"
    },
    "gemini-2.5-flash-latest": {
        "url": "your-url",
        "key": "your-gemini-key-here"
    }
}
```

#### Configuration Steps

1. **Edit Configuration File**: Open the `config/llms_config.json` file
2. **Add API Keys**: Fill in your API keys in the `key` field for the corresponding models
3. **Verify Configuration**: Test the configuration by running experiments with the configured models


#### Adding New Models

To add new LLM models, follow these steps:

1. **Update Configuration File**: Add new model configuration in `config/llms_config.json`
2. **Update Code**: Add corresponding LLM class implementation in `utils/llms.py`
3. **Register Model**: Register the new model in the `init_llm` function in `agent/load_model.py`
4. **Test and Verify**: Run tests to ensure the new model works properly

### Environment Variable Configuration (Optional)

In addition to configuration files, you can also set API keys through environment variables:

```bash
# Set environment variables
export DEEPSEEK_API_KEY="your-deepseek-key"
export OPENAI_API_KEY="your-openai-key"
export GEMINI_API_KEY="your-gemini-key"

# Use environment variables in code
import os
api_key = os.getenv("DEEPSEEK_API_KEY", "default-key")
```

## Flexible Experiment Runner (run_exp_flexible.py)

This is an enhanced experiment runner script that supports flexible separation of generation and evaluation modes.

#### Features
- **Three Execution Modes**: Generate only, evaluate only, generate and evaluate
- **Evaluation Control**: Independent control of user request evaluation and LLM evaluation
- **Directory Flexibility**: Support for custom input and output directories
- **Batch Evaluation**: Batch evaluation of existing result files
- **Force Re-evaluation**: Support for forcing re-evaluation of already evaluated results

#### Basic Usage

```bash
# Generate plans only 
python run_exp_flexible.py --mode generate_only

# Evaluate existing results only
python run_exp_flexible.py --mode evaluate_only --input_dir results/Direct_gemini

# Generate and evaluate
python run_exp_flexible.py --mode both --enable_LLM --enable_user_request_eval

# Generate with specific agent and model
python run_exp_flexible.py --mode generate_only --agent LLMNeSy --llm gpt-4o

# Evaluate with LLM evaluation enabled
python run_exp_flexible.py --mode evaluate_only --enable_LLM --input_dir results/Direct_gemini

# Force re-evaluation of all results
python run_exp_flexible.py --mode evaluate_only --force_reeval --enable_LLM
```

#### Parameters

| Parameter | Short | Default | Options                                                      | Description |
|-----------|-------|---------|--------------------------------------------------------------|-------------|
| `--mode` | | `both` | `generate_only`, `evaluate_only`, `both`                     | Execution mode |
| `--input_dir` | | `None` | Path                                                         | Input directory for evaluation mode |
| `--output_dir` | | `None` | Path                                                         | Custom output directory |
| `--force_reeval` | | `False` | Flag                                                         | Force re-evaluation of already evaluated results |
| `--enable_user_request_eval` | | `False` | Flag                                                         | Enable user request evaluation |
| `--enable_LLM` | | `False` | Flag                                                         | Enable LLM evaluation |
| `--splits` | `-s` | `synthesis`, `generalized` | Dataset type                                                 |
| `--agent` | `-a` | `Direct` | `Direct`, `LLMNeSy`, `HyperTree`, `TTG`, `LLM-modulo`, `CoT` | Agent type |
| `--llm` | `-l` | `gpt-4o` | `gemini`, `gpt-4o`, `deepseek`                               | Language model |

#### Use Cases

1. **Batch Re-evaluation**
   ```bash
   # Enable new evaluation features for all existing results
   python run_exp_flexible.py --mode evaluate_only --enable_LLM --force_reeval
   ```

2. **Cross-directory Evaluation**
   ```bash
   # Evaluate results from other methods
   python run_exp_flexible.py --mode evaluate_only --input_dir results/LLMNeSy_gpt-4o --output_dir results/LLMNeSy_gpt-4o_reeval --enable_LLM
   ```


## Data Format Description

### Input Data Format

The system accepts input data in JSON format with the following fields:

#### Query Fields

| Field            | Type    | Description                                                        |
|------------------|---------|--------------------------------------------------------------------|
| `message_id`     | Integer | Unique identifier for the query                                    |
| `context_id`     | Integer | Context identifier for grouping related queries                    |
| `day`            | String  | Number of travel days                                              |
| `departure`      | String  | Departure city name                                                |
| `arrive`         | String  | Destination city name                                              |
| `locale`         | String  | Language/locale setting (e.g., "en-US", "zh-CN")                   |
| `transportation` | String  | Transportation requirement flag                                    |
| `userQuery`      | String  | Natural language travel request                                    |
| `preference`     | String  | JSON string containing user preferences                            |
| `transport_pool` | String  | JSON string containing available transportation options            |
| `poi_pool`       | String  | JSON string containing available points of interest (POIs)         |
| `hotel_pool`     | String  | JSON string containing available hotel options                     |


#### Output Files

- **Result Files**: Same as original version, but with additional evaluation configuration information
- **Evaluation Summary**: Contains detailed statistics for both generation and evaluation
- **Mode Markers**: Results include records of execution mode and evaluation configuration

### Evaluation Metrics

The system evaluates plans across multiple dimensions:

| Metric | Description |
|--------|-------------|
| **Format Score** | JSON structure and required fields validation |
| **Commonsense Score** | Logical consistency and feasibility | 
| **Soft Constraint Score** | Schedule density, hotel consistency, etc. |
| **Preference Score** | Alignment with user preferences |
| **User Request Score** | Fulfillment of specific user requirements |
| **Total Score** | Weighted combination of all metrics |

## FAQ

### Q: How to add a new agent type?
**A**: 
1. Create a new agent class in the `agent/` directory that inherits from `BaseAgent`
2. Implement the required methods (`run`, `solve`, etc.)
3. Register the agent in `agent/load_model.py`
4. Add the agent to the command-line options in `run_exp_flexible.py`

### Q: How to customize evaluation criteria?
**A**: 
1. Modify the corresponding evaluator classes in the `evaluators/` directory
2. Each evaluator implements specific evaluation logic
3. Update scoring weights in `main_evaluator.py` if needed
4. Test changes by running experiments with the modified evaluators

### Q: How to separate generation and evaluation processes?
**A**: 
1. **Generation Phase**: `python run_exp_flexible.py --mode generate_only --agent Direct --llm gemini`
2. **Evaluation Phase**: `python run_exp_flexible.py --mode evaluate_only --enable_LLM`
3. This approach is useful for:
   - Large-scale experiments
   - Debugging evaluation issues
   - Comparing different evaluation methods

### Q: How to batch re-evaluate existing results?
**A**: 
```bash
# Re-evaluate all results with LLM evaluation
python run_exp_flexible.py --mode evaluate_only --force_reeval --enable_LLM

# Re-evaluate specific directory
python run_exp_flexible.py --mode evaluate_only --input_dir results/Direct_gemini --force_reeval
```

### Q: How to handle different datasets?
**A**: 
- Use `--splits` parameter to specify dataset type:
  - `synthesis`: Synthetic dataset
  - `generalized`: Generalized dataset

### Q: How to optimize performance?
**A**: 
1. Use `--skip 1` for large datasets to process every other case
2. Enable caching by setting appropriate `cache_dir`
3. Use `--mode generate_only` first, then evaluate separately
4. Consider using faster models for initial testing

## Important Notes

1. **API Key Configuration**: Ensure all required API keys are properly configured in `config/llms_config.json`
2. **Configuration File Security**: Do not commit configuration files with real API keys to version control
3. **Model Availability**: Some models may require specific network environment or access permissions
4. **Large Dataset Processing**: Large datasets take considerable time to process, recommend using `--skip 1` parameter
5. **Cache Management**: Cache files can be safely deleted to rerun experiments
6. **Debug Information**: Result files contain complete debugging information for troubleshooting
7. **Evaluation Mode**: When using `run_exp_flexible.py`, evaluation mode automatically saves updated result files
8. **LLM Evaluation**: LLM evaluation features take more time, recommend generating plans first then evaluating
9. **Force Re-evaluation**: `--force_reeval` overwrites existing evaluation results, use with caution 
