Safe and Sound: Evaluating Language Models for Bias Mitigation and Understanding

Published: 12 Oct 2024, Last Modified: 14 Nov 2024SafeGenAi PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: bias, fairness, LLMs
TL;DR: safe and sound LLMs
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in Natural Language Processing (NLP) tasks, but they often generate text that perpetuates societal biases and produces unsafe content. While existing approaches to mitigate these issues have shown some success, they frequently come at the cost of reduced knowledge retention and language understanding. This study investigates a method to produce safe, unbiased outputs from LLMs without compromising their core capabilities. To address this challenge, we trained already-safe LLMs on a specialized dataset containing examples of unsafe content paired with safer alternatives. Our results demonstrate that this approach enhances the model's ability to generate safe content while maintaining its language understanding capabilities. The findings of this study have significant implications for the development of more responsible and ethical AI systems. To promote transparency and facilitate further research in this area, we have made our code and dataset publicly https://github.com/llm-work/safe-llm available on GitHub
Submission Number: 53
Loading