Lost in Tokens, Recovered in Pixels: Bridging the Semantic Gap in Chinese Offensive Language Detection

Lost in Tokens, Recovered in Pixels: Bridging the Semantic Gap in Chinese Offensive Language Detection

ACL ARR 2026 January Submission2780 Authors

03 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: VLM; attack; benchmark

Abstract: Large multimodal models (LMMs), after undergoing safety-aligned fine-tuning, are generally expected to exhibit a baseline capability for detecting harmful content expressed in text. However, we identify a clear blind spot: even after safety alignment, state-of-the-art VLMs frequently fail to recognize Chinese offensive expressions when they appear as visually confusable textual variants that closely resemble benign characters. We attribute this failure to a mismatch in tokenization and lexical priors, which prevents aligned safety behaviors from being activated. Interestingly, the same models respond more appropriately when identical semantics are presented visually. Motivated by this observation, we propose \textbf{LFVR} (\textbf{L}ow-\textbf{F}requency \textbf{V}isual \textbf{R}easoning), a simple yet effective, non-invasive visual transformation that suppresses high-frequency camouflage while preserving essential character structure. Experiments on \textbf{LFVR-Bench} show that LFVR substantially improves exact toxic-term recognition on challenging visually confusable variants. Our results highlight the critical role of perceptual form in activating safety-aligned responses in multimodal models.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: LLM

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: Chinese

Submission Number: 2780

Loading