Joint Evaluation : A Human + LLM + Multi-Agents Collaborative Framework for Comprehensive AI Safety (Jo.E)
Keywords: AI Safety Evaluation, Multi-Agent Systems, LLM-as-a-Judge, Red Teaming, Human-AI Collaboration, Adversarial Testing, Foundation Models, Jailbreak Detection, Bias Assessment, GPT-4o, Claude 3.5 Sonnet, Llama 3.1, PAIR, HarmBench, Severity Scoring, Conflict Resolution, Automated Evaluation, Constitutional AI, Prompt Injection, Fairness Testing, Scalable Oversight, Detection Accuracy, RLHF
TL;DR: Jo.E is a multi-agent AI safety framework combining LLM evaluators, adversarial agents, and human experts. Achieves 94.2% accuracy (vs 78.3% single LLM), 54% less human time.
Abstract: Evaluating the safety and alignment of AI systems remains a critical challenge as foundation models grow increasingly sophisticated. Traditional evaluation methods rely heavily on human expert review, creating bottlenecks that cannot scale with rapid AI development. We introduce Jo.E (Joint Evaluation), a multi-agent collaborative framework that systematically coordinates large language model evaluators, specialized adversarial agents, and strategic human expert involvement for comprehensive safety assessments. Our framework employs a five-phase evaluation pipeline with explicit mechanisms for conflict resolution, severity scoring, and adaptive escalation. Through extensive
experiments on GPT-4o, Claude 3.5 Sonnet, Llama 3.1 70B, and Phi-3-medium, we demonstrate that Jo.E achieves 94.2% detection accuracy compared to 78.3% for single LLM-as-Judge approaches and 86.1% for Agent-as-Judge baselines, while reducing
human expert time by 54% compared to pure human evaluation. We provide a detailed computational cost analysis, showing Jo.E processes 1,000 evaluations at USD 47.30 compared to USD 312.50 for human-only approaches. Our ablation studies reveal the contribution of each component, and failure case analysis identifies systematic blind spots in current evaluation paradigms.
Submission Number: 4
Loading