---
license: mit
language:
- en
size_categories:
- n<1K
---

# Dataset Name
Privacy Risk Estimation – Synthetic User Posts

## Description
This dataset contains 50 synthetic English-language user posts intended for evaluating privacy risk estimation models. Each post simulates a plausible user-generated message that might contain varying degrees of sensitive information.

## Format
- 50 files in `.xlsx` format
- Each file contains 1 user post, annotated several for multiple different attempts.
- Columns:
  - `Subreddit`: subreddit name of the user post
  - `Post`: synthetic user post text
  - `Disclosure (Ordering)`: ordered list of personal disclosures mentioned in the post
  - `Disclosure Category`: corresponding category for each disclosure
  - `Query`: generated query for each disclosure
  - `ID`: numerical id corresponding to each row in the spreadsheet
  - `Conditionally Independent?`: conditional independencies of each disclosure with respect to all of the prior disclosures/rows in the spreadsheet. This is in the format of a list of numbers corresponding to each row's IDs.
  - `Source Used`: source utilized by human annotator to answer each row's query, mainly ChatGPT results
  - `Succeeded?`: whether the human annotator succeeded in finding an answer
  - `Reliable?`: whether the found answer is from a reliable source
  - `Annotator Confidence`: the confidence of the human annotator in judging the reliability and answer of the query
  - `Value`: the answer to each query
  - `Ground Truth`: the actual ground truth answer found from surveys, census data, or other records
  - `Ground Truth Source`: the website or domain that has the ground truth answer
  - `Ground Truth Reliability`: the reliability of the ground truth source
  - `Disclosure Type`: based on the availability of the ground truth, the type of disclosure (e.g. feasibly answered, important but cannot be answered, unimportant)
  - `ChatGPT Response`: the response of ChatGPT on answering each query
  - `Operation`: the arithmetic operation needed to reconstruct the final privacy risk estimate top down
  - `Parentheses`: BIO labels of whether the answer of each row is in a parentheses term or not to maintain the order of operations

## Language
- English (`en`)

## License
- MIT License (https://opensource.org/licenses/MIT)

## Author
- Anonymous
- Created: 2025-01-31

## Citation
If you use this dataset, please cite: To be provided upon publication.