# SimBench: A Large-Scale Benchmark for Simulating Human Behavior

[![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC%20BY--NC--SA%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)

## Overview

Simulations of human behavior using Large Language Models (LLMs) offer exciting prospects for the social and behavioral sciences. However, their utility hinges on their faithfulness to real human behaviors. SimBench addresses this challenge by providing the first large-scale benchmark designed to evaluate how well LLMs can simulate group-level human behaviors across diverse settings and tasks.

SimBench compiles 20 datasets in a unified format, measuring diverse types of behavior (e.g., decision-making vs. self-assessment) across hundreds of thousands of diverse participants from around the world. The benchmark is designed to help answer fundamental questions regarding when, how, and why LLM simulations succeed or fail.

### Key Features:
* **20 Diverse Datasets:** Covering a wide range of human behaviors including decision-making, self-assessment, judgment, and problem-solving.
* **Global Participant Diversity:** Data from participants across at least 130 countries, representing various cultural and socioeconomic backgrounds.
* **Unified Format:** All datasets are processed into a consistent structure, facilitating easy use and comparison.
* **Group-Level Focus:** Evaluates simulation of aggregated human response distributions.
* **Permissively Licensed Framework:** Enabling broad accessibility and use.

## Dataset Description

### Dataset Splits

SimBench provides two main splits for evaluation:

1. **`SimBenchPop` (Population-level Simulation):**
   * **Content:** Covers questions from all 20 datasets (7,167 test cases).
   * **Grouping:** Persona prompts are based on the general population of each source dataset.
   * **Purpose:** Measures the ability of LLMs to simulate responses of broad and diverse human populations.

2. **`SimBenchGrouped` (Demographically-Grouped Simulation):**
   * **Content:** Focuses on 5 large-scale survey datasets (AfroBarometer, ESS, ISSP, LatinoBarometro, OpinionQA), with questions selected for significant variation across demographic groups (6,343 test cases).
   * **Grouping:** Persona prompts specify particular participant sociodemographics (e.g., age, gender, ideology).
   * **Purpose:** Measures the ability of LLMs to simulate responses from specific participant groups.

### Data Fields

Each instance in the dataset contains the following primary fields:

* `dataset_name` (string): The name of the original dataset (e.g., "OpinionQA", "WisdomOfCrowds").
* `group_prompt_template` (string): A template string for constructing the persona/grouping prompt. This template may contain placeholders (e.g., `{age_group}`).
  * For `SimBenchPop`, this template often represents a default population.
  * For `SimBenchGrouped`, this template is designed to incorporate specific demographic attributes.
* `group_prompt_variable_map` (dict): A dictionary mapping placeholder variables in `group_prompt_template` to their specific values for the instance.
  * For `SimBenchPop`, this is often an empty dictionary (`{}`) if the template is self-contained.
  * For `SimBenchGrouped`, this contains the demographic attributes and their values (e.g., `{"age_group": "30-49", "country": "Kenya"}`).
  * The final persona prompt is constructed by formatting `group_prompt_template` with `group_prompt_variable_map`.
* `input_template` (string): The text of the question presented to participants/LLMs. This is typically the question stem.
* `human_answer` (dict): A dictionary representing the aggregated human response distribution for the given question and group.
  * Keys are the option labels (e.g., "A", "B", "1", "2").
  * Values are the proportions of human respondents who chose that option (e.g., `{"A": 0.25, "B": 0.50, ...}`).
* `group_size` (int): The number of human respondents contributing to the `human_answer` distribution for this specific instance.
* `auxiliary` (dict): A dictionary containing additional metadata from the original dataset. Contents vary by dataset but may include:
  * `task_id` or `question_id_original`: Original identifier for the question/task.
  * `correct_answer`: The correct option label, if applicable (e.g., for problem-solving tasks).
  * Other dataset-specific information.

## Getting Started

### Prerequisites
* Python 3.8+
* PyTorch
* Transformers
* (Other dependencies as listed in `requirements.txt`)

### Installation
1. Install dependencies:
   ```bash
   pip install -r requirements.txt
   ```

2. **API Keys (Optional, for API-based models):**
   The script can use API keys for OpenAI, Google, and OpenRouter. Create a JSON file named `api_keys` in the root of this repository or set environment variables:
   * **File Method (`api_keys`):**
     ```json
     {
         "openai": "YOUR_OPENAI_API_KEY",
         "google": "YOUR_GOOGLE_API_KEY",
         "openrouter": "YOUR_OPENROUTER_API_KEY"
     }
     ```
   * **Environment Variables:**
     ```bash
     export OPENAI_API_KEY="YOUR_OPENAI_API_KEY"
     export GOOGLE_API_KEY="YOUR_GOOGLE_API_KEY"
     export OPENROUTER_API_KEY="YOUR_OPENROUTER_API_KEY"
     # (Optional) export API_KEYS_PATH="/path/to/your/api_keys_file.json"
     ```

3. **Data Access:**
   The dataset files `SimBenchPop.csv` and `SimBenchGrouped.csv` are included in this repository.

## Running Simulations

The core script is `generate_answers.py`. It takes a SimBench `.pkl` file as input, prompts a specified LLM, and saves the results (including LLM response distributions) to an output `.pkl` file.

### Command-Line Arguments for `generate_answers.py`:

* `--input_file (str)`: Path to the input `.pkl` file (e.g., `SimBenchPop.pkl`).
* `--output_file (str)`: Path to save the output `.pkl` file with LLM responses.
* `--model_name (str)`: Name of the LLM to use.
  * For local Hugging Face models: e.g., `mistralai/Mistral-7B-Instruct-v0.1`
  * For OpenAI/Google API;
  * For OpenRouter API: e.g., `gpt-4o` (use `--openrouter` flag)
* `--method (str)`: Prompting method.
  * `token_prob`: Gets the probability of the next token being one of the option labels (e.g., 'A', 'B').
  * `verbalized`: Asks the LLM to output a JSON with estimated percentages for each option.
* `--debug (bool, optional)`: If set, runs on a small random sample (50 instances) of the dataset for quick testing.
* `--openrouter (bool, optional)`: If set, uses the OpenRouter API for models specified in `--model_name`.

### Example Usage:

**Running a local Hugging Face model (e.g., Mistral-7B-Instruct) on `SimBenchPop.pkl` using `token_prob`:**
```bash
python generate_answers.py \
    --input_file SimBenchPop.pkl \
    --output_file results/mistral_7b_instruct_token_prob_pop.pkl \
    --model_name mistralai/Mistral-7B-Instruct-v0.1 \
    --method token_prob
```

## Evaluation

After generating LLM responses, use `calculate_simbench_score.py` to compute evaluation metrics comparing the LLM response distributions to human response distributions.

## License

This benchmark framework is released under the CC-BY-NC-SA 4.0 license. Please see the `LICENSE.md` file for detailed information about licensing terms for both the framework and constituent datasets.

## Citation

If you use SimBench in your research, please cite:
```
[Citation information will be provided upon publication]
```