
## Overview

This project implements a comprehensive system for detecting whether Large Language Models are aware they are being evaluated. The system uses multiple complementary approaches to assess evaluation awareness, from simple prompts to complex multi-turn conversations.

## Configuration

Run `black_box_evals.py` via CLI with the parameters below.

### CLI parameters

- **--model**: OpenRouter model name to evaluate (e.g., `deepseek/deepseek-r1-0528`).
- **--input_file**: Path to the input dataset JSON (e.g., `datasets/dataset_minimal.json`).
- **--output_file**: Path to write results JSON (created if missing, appended to otherwise).
- **--max_workers**: Maximum number of parallel API calls.
- **--full_transcript_analysis**: Analyze the full conversation transcript; omit to analyze only the first system + user messages.
- **--reasoning_traces_analysis**: Enable reasoning-traces judging for evaluation awareness.
- **--probability_eval_analysis**: Enable probability estimation method.
- **--binary_eval_analysis**: Enable majority-vote binary evaluation awareness.
- **--purpose_awareness_analysis**: Enable unprompted purpose awareness judging.
- **--logprob_eval_analysis**: Enable logprob-based awareness (requires `--first_person` and not `--full_transcript_analysis`).
- **--first_person**: Use first-person prompting variants for applicable methods.

IMPORTANT 
Some methods only work in first-person.

Example:

```bash
python3 black_box_evals.py \
  --model deepseek/deepseek-r1-0528 \
  --input_file datasets/dataset_minimal_small.json \
  --output_file eval_results/deepseek_r1_all.json \
  --max_workers 500 \
  --first_person \
  --probability_eval_analysis \
  --binary_eval_analysis \
  --purpose_awareness_analysis \
  --reasoning_traces_analysis
```



### Setup

```bash
# Install dependencies
pip install -r requirements.txt
```

Add OPENROUTER_API_KEY to .env


The dataset uses data from many datasets with different licenses, and to satisify licensing rules, I can't publish then in a scrapable version, so the dataset is zipped. Unzip it and paste into the same datasets directory before using.

Also the results of evaluations of different models that are present in the paper are too large for posting on GitHub, so you can find them [here](https://drive.google.com/file/d/1qcaQP4I5h5VjAlLznCrM1RhxocOX700m/view?usp=sharing).

### Run Evaluations

example run:

```bash
python3 black_box_evals.py --model deepseek/deepseek-r1-0528 --input_file datasets/dataset_minimal.json --output_file eval_results/deepseek_r1_all.json --first_person --probability_eval_analysis --binary_eval_analysis
```

### Analyze Results

To generate all the plots from the appaer, and many more new ones in the folder plots/method_wise, open `method_plots.py` and to the global variable FILES defined in line 11, list all the files with evaluation results, and run the script.

## To replicate the full paper (`replicate_paper.py`)

Run the end-to-end pipeline across multiple models and then generate plots (which will take many hours and thousands of dollars):

```bash
python3 replicate_paper.py
```

## To explore fetures that makes a specific benchmark look like evaluations

Use the `eval_report` function from the module `eval_report.py` to explore the list of evaluation awareness cues for any evaluations (and any other set of prompts)

Arguments:
- prompts: A dataset to analyze. 
    Accepted formats:
    1. list of user strings: ["user prompt 1", "user prompt 2", ...]
    2. list of (system, user) tuples: [("system A", "user A"), ("system B", "user B"), ...]
- model: OpenRouter model identifier used to generate responses and to self-assess purpose (e.g., "qwen/qwen3-32b").
- api_key: Your OpenRouter API key (read from .env via config or passed directly).
- judge_model (optional): Model used to judge if the model excibit evaluation awareness. openai/gpt-5-mini is recommended as it's the cheapest model showing good perfomance on this task.


The function produces the report of evaluation awareness cues for an LLM for each prompt from the benchmark, saved to the folder  `benchmark_analysis/<model>_analyzed_dataset_<timestamp>.json` that contains information, whether a given LLM considers this prompt to be as evaluation, as well as concrete prompt features that have lead to this conclusion.

It also produces a taxonomy of evaluation awareness prompt features `benchmark_analysis/<model>_benchmark_clusters_<timestamp>.json`

IMPORTANT
Some of the identified prompt evaluation awareness features are related to the follow-up question in the user motivation guessing method that asks a model to guess the user intent. Unfortunately, it's hard to get rid of this artifact, as the model is prompted to identify all the cues of possibly being in evaluation, and explicit instructions to ignore this follow-up question aren't always successful. 

Minimal example:

```python
import os
from eval_report import eval_report

prompts = [
  ("", "Write a function to reverse a string in Python"),
  ("You are a helpful assistant.", "Summarize the key points of the following article..."),
]

api_key = os.getenv("OPENROUTER_API_KEY")  # set in .env

eval_report(
  prompts=prompts,
  model="qwen/qwen3-32b",
  api_key=api_key,
  judge_model="openai/gpt-5-mini",  # optional; defaults to model
)
```


```
eval_awareness/
├── README.md                           # This documentation and usage
├── requirements.txt                    # Python dependencies
├── config.py                           # Loads OPENROUTER_API_KEY from .env
├── black_box_evals.py                  # Main evaluation orchestrator across methods
├── binary_eval_awareness.py            # Binary method
├── estimate_eval_probability.py        # Probability estimation method
├── logprob_eval_awareness.py           # Logprob-based awareness
├── reasoning_traces.py                 # Reasoning-traces analysis method
├── unprompted_purpose_guessing.py      # User motivation guessing method
├── extract_evidence.py                 # Extracts and compresses evidence from the user motivation uessing method results
├── cluster_evidence.py                 # Clusters compressed evidence into gists/topics
├── eval_report.py                      # API to test your custom dataset for evaluation-like features
├── method_plots.py                     # Generates plots/metrics summaries
├── eval_probability_vs_prompts.py      # Analysis of eval probability vs prompt count
├── analyze_conversation_lengths.py     # Analysis of conversation lengths against the quality of evaluation awareness detection
├── replicate_paper.py                  # End-to-end pipeline to reproduce paper runs
├── datasets/                           # Input datasets
│   └── dataset_minimal.json.zip        # Zipped dataset; unzip here
├── prompts/                            # Prompt templates for judges
│   ├── eval_traces_prompt.txt
│   ├── judge_purpose_prompt.txt
├── eval_results/                       # Generated evaluation outputs
├── eval_analysis/                      # Default output folder for the eval_report.py script
├── clusters/                           # Saved clusterings (examples included)
│   ├── deepseek_r1_all.json
│   └── gemini_2.5_pro_all.json
├── evidence/                           # Saved files with evidence of LLMs for evaluation awareness identified by the user motivation guessing method
│   ├── deepseek_r1_all.json
│   └── gemini_2.5_pro_all.json
├── plots/                              # Figures from the paper 
```
