CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation

CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation

ICLR 2026 Conference Submission16132 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: AI Safety, Red-Teaming, Safety Alignment, Korean Red-Teaming

TL;DR: Translating safety benchmark to other cultures considering cultural knowledge to capture socio-technical blindspot in safety evaluation.

Abstract: Existing red-teaming benchmarks, when adapted to new languages via direct translation, fail to capture socio-technical vulnerabilities rooted in local culture and law, creating a critical blind spot in LLM safety evaluation.To address this gap, we introduce CAGE (Culturally Adaptive Generation), a framework that systematically adapts the adversarial intent of proven red-teaming prompts to new cultural contexts. At the core of CAGE is the Semantic Mold, a novel approach that disentangles a prompt’s adversarial structure from its cultural content. This approach enables the modeling of realistic, localized threats rather than testing for simple jailbreaks. As a representative example, we demonstrate our framework by creating KoRSET, a Korean benchmark, which proves more effective at revealing vulnerabilities than direct translation baselines. CAGE offers a scalable solution for developing meaningful, context-aware safety benchmarks across diverse cultures.

Supplementary Material: pdf

Primary Area: datasets and benchmarks

Submission Number: 16132

Loading