Abstract: Understanding what constitutes safe text is an
important issue in natural language processing and can often prevent the deployment of
models deemed harmful and unsafe. One such
type of safety that has been scarcely studied is commonsense physical safety, i.e. text
that is not explicitly violent and requires additional commonsense knowledge to comprehend that it leads to physical harm. We create
the first benchmark dataset, SAFETEXT, comprising real-life scenarios with paired safe and
physically unsafe pieces of advice. We utilize SAFETEXT to empirically study commonsense physical safety across various models designed for text generation and commonsense
reasoning tasks. We find that state-of-the-art
large language models are susceptible to the
generation of unsafe text and have difficulty
rejecting unsafe advice. As a result, we argue
for further studies of safety and the assessment
of commonsense physical safety in models before release.
0 Replies
Loading