Structuring Semantic Embeddings for Principle Evaluation: A Kernel-Guided Contrastive Learning Approach
Abstract: Reliable post-hoc principle evaluation—verifying whether generated text adheres to predefined human values such as safety, fairness, or helpfulness—is a critical bottleneck in AI alignment. While general-purpose text embeddings are widely deployed for this task, they inherently struggle with fine-grained principle distinctions due to severe feature entanglement. Texts sharing similar vocabulary but representing diametrically opposed principles often collapse into the same representation space, blurring critical decision boundaries. To overcome this limitation without the prohibitive costs of full-parameter fine-tuning, we introduce Kernel-Guided Contrastive Learning (KGCL), a framework that transitions the evaluation paradigm from generic semantic approximation to explicit decision boundary sculpting. Operating as a lightweight module atop frozen generalist encoders, KGCL projects entangled embeddings into a structured, principle-aligned subspace. We mathematically prove that our composite objective establishes a defined geometric margin and establishes strict bounds on geometric clustering metrics. Extensive experiments validate these theoretical guarantees, demonstrating that KGCL dramatically enhances the linear separability of highly confusable classes and provides a geometric shield against majority collapse. Remarkably, our explicitly optimized embeddings not only achieve absolute F1 improvements of up to 19.4\% over task-agnostic contrastive baselines but also consistently outperform the implicit in-context reasoning of massive generative Large Language Models. Ultimately, KGCL establishes that targeted geometric sculpting provides a highly discriminative, computationally efficient paradigm for robust principle alignment.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Martha_Lewis1
Submission Number: 8424
Loading