
# Dataset Card for 🧑‍⚖️SORRY-Bench Human Judgment Dataset

This dataset contains **7K annotations of human safety judgments** for LLM responses to unsafe instructions of our SORRY-Bench dataset.
Specifically, for each unsafe instruction of the 440 unsafe instructions in SORRY-Bench dataset, we annotate 16 diverse model responses (both ID and OOD) as either in "*fulfillment*" (score=1) of, or "*refusal*" (score=0) to that unsafe instruction.
We split these 440 * 16 = 7040 records ([human_ratings.jsonl](human_rating.jsonl)) into:

- A train split ([train.jsonl](train.jsonl)): 2640 records, reserved for boosting automated safety evaluators accuracy via fine-tuning (e.g., we fine-tune Mistral-7B-Instruct-v0.2 on these data to obtain our 🤖judge LLM) or few-shot prompting;
- A test split ([test.jsonl](test.jsonl)): 4400 records, intended for evaluating the agreement between automated safety evaluators and human annotators.

We use this dataset for meta-evaluation to compare different design choices of automated safety evaluators (results shown below).
Refer to our paper for more details.


## SORRY-Bench Human Judgment Dataset License Agreement

This Agreement contains the terms and conditions that govern your access and use of the SORRY-Bench-Human-Judgment Dataset (as defined above). You may not use the SORRY-Bench-Human-Judgment Dataset if you do not accept this Agreement. By clicking to accept, accessing the SORRY-Bench-Human-Judgment Dataset, or both, you hereby agree to the terms of the Agreement. If you are agreeing to be bound by the Agreement on behalf of your employer or another entity, you represent and warrant that you have full legal authority to bind your employer or such entity to this Agreement. If you do not have the requisite authority, you may not accept the Agreement or access the SORRY-Bench-Human-Judgment Dataset on behalf of your employer or another entity.

* Safety and Moderation: **This dataset contains unsafe conversations or prompts that may be perceived as offensive or unsettling.** Users may not use this dataset for training machine learning models for any harmful purpose. The dataset may not be used to generate content in violation of any law. These prompts should not be used as inputs to models that can generate modalities outside of text (including, but not limited to, images, audio, video, or 3D models)
* Non-Endorsement: The views and opinions depicted in this dataset **do not reflect** the perspectives of the researchers or affiliated institutions engaged in the data collection process.
* Legal Compliance: You are mandated to use it in adherence with all pertinent laws and regulations.
* Model Specific Terms: When leveraging direct outputs of a specific model, users must adhere to its **corresponding terms of use and relevant legal standards**.
* Non-Identification: You **must not** attempt to identify the identities of individuals or infer any sensitive personal data encompassed in this dataset.
* Prohibited Transfers: You **should not** distribute, copy, disclose, assign, sublicense, embed, host, or otherwise transfer the dataset to any third party.
* Right to Request Deletion: At any time, we may require you to delete all copies of this instruction dataset (in whole or in part) in your possession and control. You will promptly comply with any and all such requests. Upon our request, you shall provide us with written confirmation of your compliance with such requirement.
* Termination: We may, at any time, for any reason or for no reason, terminate this Agreement, effective immediately upon notice to you. Upon termination, the license granted to you hereunder will immediately terminate, and you will immediately stop using the SORRY-Bench-Human-Judgment Dataset and destroy all copies of the SORRY-Bench-Human-Judgment Dataset and related materials in your possession or control.
* Limitation of Liability: IN NO EVENT WILL WE BE LIABLE FOR ANY CONSEQUENTIAL, INCIDENTAL, EXEMPLARY, PUNITIVE, SPECIAL, OR INDIRECT DAMAGES (INCLUDING DAMAGES FOR LOSS OF PROFITS, BUSINESS INTERRUPTION, OR LOSS OF INFORMATION) ARISING OUT OF OR RELATING TO THIS AGREEMENT OR ITS SUBJECT MATTER, EVEN IF WE HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

Subject to your compliance with the terms and conditions of this Agreement, we grant to you, a limited, non-exclusive, non-transferable, non-sublicensable license to use the SORRY-Bench-Human-Judgment Dataset, including the conversation data and annotations, to research, and evaluate software, algorithms, machine learning models, techniques, and technologies for both research and commercial purposes.

