Keywords: LLM Safety, Alignment
TL;DR: Bangla LLM safety and alignment benchmarking.
Abstract: We present BanglaGuard, the first comprehensive safety framework for Bengali large language models (LLMs). BanglaGuard introduces a curated dataset of 29,950 safe and unsafe Bangla prompts paired with culturally appropriate refusal responses, and a three-tier defense pipeline combining prompt classification, LoRA-based fine-tuning, and response classification. Across multiple Bangla and multilingual LLMs, fine-tuning improved refusal rates by 25–33 points and sharply reduced unsafe completions. The best-performing model, LLaMA-2-7B-Chat, achieved a refusal rate of 61.0\% and reduced unsafe completions to 5.0\% with the full framework. These results demonstrate that BanglaGuard provides effective, low-resource safety alignment for Bangla LLMs, offering a scalable blueprint for multilingual safety research.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 24013
Loading