Lost in Tokens, Recovered in Pixels: Bridging the Semantic Gap in Chinese Offensive Language Detection
Keywords: VLM; attack; benchmark
Abstract: Large multimodal models (LMMs), after undergoing safety-aligned fine-tuning, are generally expected to exhibit a baseline capability for detecting harmful content expressed in text. However, we identify a clear blind spot: even after safety alignment, state-of-the-art VLMs frequently fail to recognize Chinese offensive expressions when they appear as visually confusable textual variants that closely resemble benign characters. We attribute this failure to a mismatch in tokenization and lexical priors, which prevents aligned safety behaviors from being activated. Interestingly, the same models respond more appropriately when identical semantics are presented visually.
Motivated by this observation, we propose \textbf{LFVR} (\textbf{L}ow-\textbf{F}requency \textbf{V}isual \textbf{R}easoning), a simple yet effective, non-invasive visual transformation that suppresses high-frequency camouflage while preserving essential character structure. Experiments on \textbf{LFVR-Bench} show that LFVR substantially improves exact toxic-term recognition on challenging visually confusable variants. Our results highlight the critical role of perceptual form in activating safety-aligned responses in multimodal models.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: LLM
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: Chinese
Submission Number: 2780
Loading