Evaluating AI Safety in Polish: An Automated Red-Teaming Approach

Evaluating AI Safety in Polish: An Automated Red-Teaming Approach

ICLR 2025 Workshop BuildingTrust Submission27 Authors

06 Feb 2025 (modified: 06 Mar 2025)Submitted to BuildingTrustEveryoneRevisionsBibTeXCC BY 4.0

Track: Tiny Paper Track (between 2 and 4 pages)

Keywords: rainbow teaming, safety, llms, red teaming

TL;DR: We reproduce and extend the red teaming methodology to polish language and test it on several LLMs

Abstract: The development of multilingual large language models (LLMs) presents challenges in evaluating their safety across all supported languages. Enhancing safety in one language (e.g., English) may inadvertently introduce vulnerabilities in others. To address this issue, we propose a methodology for the automatic creation of red-teaming datasets for safety evaluation, categorizing them by risk type and attack style. We apply our methodology to the Polish language, highlighting the disparity between focusing on English and on Polish when generating safe outputs.

Submission Number: 27

Loading