Thousand Voices of Trauma: A Large-Scale Synthetic Dataset for Modeling Prolonged Exposure Therapy Conversations

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track spotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Synthetic Clinical Dialogues, Mental Health Data Generation
TL;DR: Our paper introduces "Thousand Voices of Trauma," a large-scale synthetic dataset of simulated PTSD therapy conversations designed to overcome data limitations and accelerate AI research in mental healthcare.
Abstract: The advancement of AI systems for mental health support is hindered by limited access to therapeutic conversation data, particularly for trauma treatment. We present Thousand Voices of Trauma, a synthetic benchmark dataset of 3,000 therapy conversations based on Prolonged Exposure therapy protocols for Post-traumatic Stress Disorder (PTSD). The dataset comprises 500 unique cases, each explored through six conversational perspectives that mirror the progression of therapy from initial anxiety to peak distress to emotional processing. We incorporated diverse demographic profiles (ages 18-80, M=49.3, 49.4\% male, 44.4\% female, 6.2\% non-binary), 20 trauma types, and 10 trauma-related behaviors using deterministic and probabilistic generation methods. Analysis reveals realistic distributions of trauma types (witnessing violence 10.6\%, bullying 10.2\%) and symptoms (nightmares 23.4\%, substance abuse 20.8\%). Clinical experts validated the dataset's therapeutic fidelity, highlighting its emotional depth while suggesting refinements for greater authenticity. We also developed an emotional trajectory benchmark with standardized metrics for evaluating model responses. This privacy-preserving dataset addresses critical gaps in trauma-focused mental health data, offering a valuable resource for advancing both patient-facing applications and clinician training tools.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/yenopoya/thousand-voices-trauma
Code URL: https://huggingface.co/datasets/yenopoya/thousand-voices-trauma
Supplementary Material: zip
Primary Area: AL/ML Datasets & Benchmarks for health sciences (e.g. climate, health, life sciences, physics, social sciences)
Flagged For Ethics Review: true
Submission Number: 2195
Loading