Track: Tiny Paper Track (between 2 and 4 pages)
Keywords: rainbow teaming, safety, llms, red teaming
TL;DR: We reproduce and extend the red teaming methodology to polish language and test it on several LLMs
Abstract: The development of multilingual large language models (LLMs) presents challenges in evaluating their safety across all supported languages. Enhancing safety in one language (e.g., English) may inadvertently introduce vulnerabilities in others. To address this issue, we propose a methodology for the automatic creation of red-teaming datasets for safety evaluation, categorizing them by risk type and attack style. We apply our methodology to the Polish language, highlighting the disparity between focusing on English and on Polish when generating safe outputs.
Submission Number: 27
Loading