Keywords: RAG, content moderation, false negatives, recall-first framework, distribution-preserving augmentation, contrastive augmentation, committee-diverse retrieval, dense retrieval, MMR retrieval, graph-based retrieval, LLaMA-3, FAISS, reproducibility, NLP safety, fairness evaluation, semantic-aware augmentation
Abstract: False negatives—missed unsafe content—remain the dominant risk in safety-critical moderation. We present a novel recall-first moderation framework that integrates two complementary innovations: (i) distribution-preserving contrastiveaugmentation, which generates boundary-focused hard positives and negatives while statistically preserving corpus structure, and (ii) committee-diverse re-trieval, which combines dense, MMR, and graph-based selectors to construct label-informative, non-redundant neighborhoods at inference. Augmented corpora are validated with KL/JS divergence thresholds (≤ 0.05 globally), confirming indistinguishability from the source distribution. On a large held-out test se tof multidomain unbalanced text, vanilla retrieval-augmented pipelines expose the persistent failure mode of under-detecting FLAGGED content (recall ≈ 0.44), but also reveal a strong baseline gap: an open-source stack (FAISS + local LLaMA-3) achieves significantly higher accuracy and macro-F1 than a commercial counter-part (API embeddings + hosted LLM). Adding augmentation and committee retrieval improves sensitive-class recall by ∼ 10 points (to ≈ 0.56) while maintaining global performance, with graph-aware retrieval pushing open-source accuracy to 0.8510 and Macro-F1 to 0.7635. Ensemble experiments with DistilRoBERTa further raise recall to 0.5781 without loss of utility.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22436
Loading