# SimpleToM Dataset and Evaluation data

This folder contains the stories and questions for the SimpleToM dataset, along with
evaluation data for a number of models as described in the paper 
"SimpleToM: LLMs can infer mental states, but struggle to apply them"

## File structure

 * **SimpleToM-story-data.jsonl** contains the core data and metadata for each story
 * **SimpleToM-QA/** directory with the three QA datasets (mental-state, behavior-prediction, judgment)
 * **eval-data/** directory with evaluation data for each model

## File formats

### SimpleToM-story-data.jsonl format

Example story data with select comments:

```json
{
  "story_id":"gen1169_sev3",
  "core_id":"gen1169",
  "scenario_name":"inside_reuse_labeled_containers",
  "story_prompt_example":"gen168",  // Story used as prompt
  "entities":{"Person X":"Mike's friend","Person Y":"Mike","Object Z":"Oreo cookie package"},
  "severity_original":3,  // Original severity rating
  "harmfulness_predicted":[  // Harmfulness predictions
    {"harmfulness":3,"predictor_model":"claude-3-5-sonnet-20240620"},
    {"harmfulness":3,"predictor_model":"gpt-4o-2024-05-13"}
  ],
  "unethicality_predicted":[  // unethicality predictions
    {"unethicality":3,"predictor_model":"claude-3-5-sonnet-20240620"},
    {"unethicality":3,"predictor_model":"gpt-4o-2024-05-13"}
    ],
  "story":"Mike replaced the Oreo cookies in the package with dog treats that look similar to Oreos. Mike's friend spots the Oreo package on the kitchen table and reaches for it.",
  "key_information":"Mike replaced the Oreo cookies in the package with dog treats that look similar to Oreos.",
  "story_sentence":"Mike's friend spots the Oreo package on the kitchen table and reaches for it.",
  "story_generator_model":"claude-3-5-sonnet-20240620",  // Model used to generate story
  "questions":[
    {"question_id":"gen1169_sev3_aware",
      "question":"Is Mike's friend likely to be aware that \"Mike replaced the Oreo cookies in the package with dog treats that look similar to Oreos.\"?",
      "answer_correct":"No","answer_incorrect":"Yes"
    },
    {"question_id":"gen1169_sev3_action",
      "question":"What will Mike's friend likely do next?",
      "answer_correct":"Open the package and eat one of the cookies.",
      "answer_incorrect":"Express shock and concern about the potential health risks of eating dog treats."
    }
  ]
}
```

### SimpleToM-QA/mental-state-qa.jsonl format 

Standard multiple-choice QA format. Similar formats are used 
in the behavior-prediction-qa.jsonl and judgment-qa.jsonl files. Sample entry:

```json
{
  "id":"gen1169_sev3_aware",
  "story":"Mike replaced the Oreo cookies in the package with dog treats that look similar to Oreos. Mike's friend spots the Oreo package on the kitchen table and reaches for it.",
  "question":"Is Mike's friend likely to be aware that \"Mike replaced the Oreo cookies in the package with dog treats that look similar to Oreos.\"?",
  "scenario_name":"inside_reuse_labeled_containers",
  "choices":{"text":["Yes","No"],"label":["A","B"]},
  "answerKey":"B"
}
```


### eval-data/SimpleToM-eval-data-gpt-4o-2024-05-13.json format

The overall format of each evaluation looks like:
```json
{
  "model":"gpt-4o-2024-05-13",
  "evaluations":{
    "MS": [<mental_state_evaluations>],
    "BP": [<behavior_prediction_evaluations>],
    "JU": [<judgment_evaluations>],
    "BP_MSRemind": [<behavior_prediction_evaluations using MSRemind prompt>],
    "BP_CoT": [<behavior_prediction_evaluations using CoT prompt>],
    "BP_CoT*": [<behavior_prediction_evaluations using CoT* prompt>],
    "BP_CoT_MSRemind": [<behavior_prediction_evaluations using CoT and MSRemind prompts>],
    "BP_CoT*_MSRemind": [<behavior_prediction_evaluations using CoT* and MSRemind prompts>],
    "JU_MSRemind": [<judgment_evaluations using MSRemind prompt>],
    "JU_CoT": [<judgment_evaluations using CoT prompt>],
    "JU_CoT*": [<judgment_evaluations using CoT* prompt>],
    "JU_CoT_MSRemind": [<judgment_evaluations using CoT and MSRemind prompts>],
    "JU_CoT*_MSRemind": [<judgment_evaluations using CoT* and MSRemind prompts>],
    "BP_SysP": [<behavior_prediction_evaluations using SysP prompt>],
    "BP_SysP*": [<behavior_prediction_evaluations using SysP* prompt>],
    "JU_SysP": [<judgment_evaluations using SysP prompt>],
    "JU_SysP*": [<judgment_evaluations using SysP* prompt>]
  }
}
```

The format of each evaluation instance is:

```json
{
    "story_id":"gen1169_sev3",
    "question_type":"MS",
    "correct":"B",
    "predicted":"B",
    "acc":1,
    "output_text":"(B)",
    "num_output_tokens":2  // Number of output tokens, including reasoning tokens (for o1 models)
}
```