# Persona-Primed Language Model Evaluation

This repository contains the implementation for evaluating large language models using persona-priming across multiple domains. The project investigates how different expert personas (generic, historical, and modern) affect model performance on domain-specific tasks.

## Overview

The evaluation pipeline tests language models across four domains:
- **Mathematics**: Problem-solving tasks (GSM8K dataset)
- **Psychology**: Professional psychology questions (MMLU Psychology dataset)
- **Legal**: Legal reasoning tasks (BarExam QA dataset)

## Key Components

### Data Loading and Processing
- `load_data.py`: Loads and preprocesses datasets from HuggingFace for math,  psychology, and legal domains. Converts all datasets to a unified JSONL format with standardized multiple-choice structure.

### Prompt Construction
- `build_prompt.py`: Creates different prompt variants for each dataset item:
  - **Baseline prompts**: Domain-agnostic prompts
  - **Primed prompts**: Domain-specific instruction prompts
  - **Persona prompts**: Expert persona-based prompts (generic experts, historical figures, modern experts)

### Model Inference
- `infer_gemini.py`: Handles inference using OpenRouter API to access various language models. Supports both chain-of-thought and direct answer modes.
- `infer_cross_domain.py`: Conducts cross-domain evaluation experiments
- `infer_negation.py`: Tests model performance on negation tasks

### Evaluation
- `evaluate_accuracy.py`: Evaluates model accuracy for multiple-choice tasks
- `evaluate_math_accuracy.py`: Specialized evaluation for mathematical problem-solving tasks (handles numeric answers)

### Additional Tools
- `build_negation_prompt.py`: Creates prompts for negation experiments
- `build_own_persona.py`: Utility for creating custom persona prompts

## Usage

### 1. Load and Process Dataset
```bash
# Load mathematics dataset
python load_data.py --dataset gsm8k --out data/math/dataset/gsm8k.jsonl

# Load psychology dataset
python load_data.py --dataset mmlu_psychology --out data/psychology/dataset/mmlu_psychology.jsonl

# Load legal dataset
python load_data.py --dataset barexam_qa --out data/legal/dataset/barexam_qa.jsonl
```

### 2. Build Prompts
```bash
# Create prompt variants for each dataset
python build_prompt.py --in data/math/dataset/gsm8k.jsonl --out data/math/prompts/prompts.jsonl
```

### 3. Run Inference
```bash
# Run inference with chain-of-thought reasoning
python infer_gemini.py --in data/math/prompts/prompts.jsonl --out results/math --cot

# Run inference without chain-of-thought
python infer_gemini.py --in data/math/prompts/prompts.jsonl --out results/math

# Run inference on specific domain
python infer_gemini.py --domain math --cot
```

### 4. Evaluate Results
```bash
# Evaluate multiple-choice tasks
python evaluate_accuracy.py --results results/legal/generations.jsonl

# Evaluate mathematical problem-solving
python evaluate_math_accuracy.py --results results/math/generations.jsonl
```

## Requirements

- Python 3.7+
- Required packages: `datasets`, `openai`, `tenacity`, `tqdm`, `pathlib`
- OpenRouter API key for model access

## Dataset Availability

Due to size constraints, the processed datasets are not included in this repository. The datasets will be made publicly available upon publication of the associated research paper. Raw datasets are available through HuggingFace:
- GSM8K: `gsm8k`
- MMLU Psychology: `cais/mmlu` (professional_psychology subset)
- BarExam QA: `reglab/barexam_qa`

