ToxiTrace: Gradient-Aligned Training for Explainable Chinese Toxicity Detection

ToxiTrace: Gradient-Aligned Training for Explainable Chinese Toxicity Detection

ACL ARR 2026 January Submission10472 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Chinese toxic content detection, interpretability of models, attribution method, contrastive learning

Abstract: Existing Chinese toxic content detection methods mainly target sentence-level classification but often fail to provide readable and contiguous toxic evidence spans. We propose ToxiTrace, an explainability-oriented method for BERT-style encoders with three components: (1) CuSA, which refines encoder-derived saliency cues into fine-grained toxic spans with lightweight LLM guidance; (2) GCLoss, a gradient-constrained objective that concentrates token-level saliency on toxic evidence while suppressing irrelevant activations; and (3) ARCL, which constructs sample-specific contrastive reasoning pairs to sharpen the semantic boundary between toxic and non-toxic content. Experiments show that ToxiTrace improves classification accuracy and toxic span extraction while preserving efficient encoder-based inference and producing more coherent, human-readable explanations.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: explanation faithfulness, feature attribution, contrastive explanations

Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency

Languages Studied: Chinese

Submission Number: 10472

Loading