# GeoBias-20Q: Evaluating Geographic Bias in LLMs via the 20 Questions Game

This repository contains the codebase for our COLM 2025 submission:

> **“The World According to LLMs: How Geographic Origin Influences LLMs’ Entity Deduction Capabilities”**

We build an interactive deduction framework based on the 20 Questions game, where an LLM plays the role of the **guesser** to deduce a hidden **notable person** or **culturally salient object**. The focus is on assessing **geographic disparities** in reasoning, language robustness, and knowledge representation.

## Repository Structure

```
.
├── 20Q_notable_people.py        # Play 20Q with notable people
├── 20Q_things.py                # Play 20Q with culturally significant objects
├── eval_result.py               # Evaluate results after gameplay
├── game.py                      # Game engine logic
├── utils.py                     # Data loading, parsing utilities
├── prompts.json                 # Language-specific prompts
├── requirements.txt             # Python dependencies
└── data/
    └── processed/
        ├── country-wise/        
        │   ├── people.txt       # Contains 10 most popular notable people from each country
        │   └── things.txt       # Contains 10 most popular things from each country
        └── continent-wise/
            ├── people.txt       # Contains 100 most popular notable people from each continent
            └── things.txt       # Contains 100 most popular things from each continent
    └── raw/
        ├── people.csv           # Contains all notable people data
        └── things.csv           # Contains all things data
```

## Setup

### 1. Install Dependencies
```bash
pip install -r requirements.txt
```

### 2. Set API Keys (as environment variables)
Depending on the model you're using:
```bash
export OPENAI_API_KEY=your_openai_key_here
export GEMINI_API_KEY=your_gemini_key_here
```

## Playing the 20 Questions Game

You can choose to run the game for **Notable People** or **Things**, and specify the desired **language** and **model**.

### Required Arguments:
- `--input`: Path to `.txt` file containing entity names
- `--guesser_model`: Name of the LLM to be used as a guesser (e.g., `gpt-4o-mini-2024-07-18`, `gemini-2.0-flash`, `meta-llama/Llama-3.3-70B-Instruct`)
- `--turns`: Maximum number of turns to be given to the guesser
- '`--temp`: Set the temperature for model sampling.
- `--language`: One of the supported languages:  
  `english`, `hindi`, `mandarin`, `japanese`, `french`, `spanish`, `turkish`

### Example: Running for Notable People with country-wise distribution in english language using gemini
```bash
python 20Q_notable_people.py   --input data/processed/country-wise/people.txt   --guesser_model gemini-2.0-flash   --language english
```

### Example: Running for Things with continent-wise distribution in spanish language using gpt
```bash
python 20Q_things.py   --input data/processed/continent-wise/things.txt   --guesser_model gpt-4o-mini-2024-07-18   --language spanish
```

### Output Structure

After running either script:

- A folder named **after the model** (e.g., `gemini-2.0-flash`) will be created inside the directory that contained the input `.txt` file.
- This folder will contain:
  - One `.txt` file for **each entity**.
  - Each file logs the complete **dialogue history** of that specific game session.

## Evaluating Results

After games are played, use `eval_result.py` to compute performance metrics.

### Example: Evaluating games on entities distributed country-wise and run on gemini-2.0-flash in english language
```bash
python eval_result.py   --dir data/processed/country-wise/gemini-2.0-flash   --language english
```

### What It Does:
- Reads all game transcripts in the specified `--dir`
- Computes metrics per entity
- Saves a CSV report (`metrics.csv`) in the same directory containing metrics for each entity

## Evaluation Metrics

Each entity is evaluated on the following:

- Success: Was the entity correctly guessed?
- Turns to Answer: Number of turns taken to guess successfully.
- Turns to Give Up: Number of turns before the model gives up (if it never guesses).

These match the exact metrics reported in our work.

## Supported Languages

- English
- Hindi
- Mandarin
- Japanese
- French
- Spanish
- Turkish

