Abstract: Lightweight Large Language Models (LLMs) have demonstrated notable safety alignment deficits, particularly outside of English. These challenges are especially acute for Traditional Chinese (TC), stemming from distinct linguistic characteristics and a scarcity of dedicated safety resources. To address this, we introduce the Prompt Assortment for Traditional Chinese Hazards (PATCH) dataset, the first large-scale adversarial dataset tailored for TC safety evaluation, aligned with standard threat taxonomies. Using PATCH, we evaluated Llama Guard, RoBERTa, and Longformer architectures with full fine-tuning, Low-Rank Adaptation (LoRA), and Chat-Vector methods. Our findings demonstrate that parameter-efficient LoRA achieves classification performance (F1 > 0.99) comparable to full fine-tuning, providing an effective and efficient method for developing TC safety classifiers. We also find initial evidence suggesting targeted LoRA tuning may offer cross-lingual safety benefits.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: corpus creation, datasets for low resource languages, evaluation methodology, benchmarking
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: Traditional Chinese, English
Submission Number: 215
Loading