BanglaGuard: Benchmarking and Defending Large Language Models for Safety in Low-Resource Languages

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Safety, Alignment
TL;DR: Bangla LLM safety and alignment benchmarking.
Abstract: We present BanglaGuard, the first comprehensive safety framework for Bengali large language models (LLMs). BanglaGuard introduces a curated dataset of 29,950 safe and unsafe Bangla prompts paired with culturally appropriate refusal responses, and a three-tier defense pipeline combining prompt classification, LoRA-based fine-tuning, and response classification. Across multiple Bangla and multilingual LLMs, fine-tuning improved refusal rates by 25–33 points and sharply reduced unsafe completions. The best-performing model, LLaMA-2-7B-Chat, achieved a refusal rate of 61.0\% and reduced unsafe completions to 5.0\% with the full framework. These results demonstrate that BanglaGuard provides effective, low-resource safety alignment for Bangla LLMs, offering a scalable blueprint for multilingual safety research.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 24013
Loading