Jo.E (Joint Evaluation) : A Multi-Agent Collaborative Framework for Comprehensive AI Safety Evaluation

08 Nov 2025 (modified: 27 Nov 2025)Submitted to E-SARSEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI Safety Evaluation, Multi-Agent Systems, Human-AI Collaboration, Foundation Model Assessment, Adversarial Robustness, Automated Testing, LLM Evaluation
TL;DR: Jo.E is a multi-agent AI safety framework that improves vulnerability detection by 22% while cutting human expert time by 54% through strategic coordination of automated evaluators and human reviewers.
Abstract: Evaluating the safety and alignment of AI systems remains a critical challenge as foundation models grow increasingly sophisticated. Traditional evaluation methods rely heavily on human expert review, creating bottlenecks that cannot scale with the rapid pace of AI development. We introduce Jo.E (Joint Evaluation), a novel multi-agent collaborative framework that combines large language model evaluators, specialized AI agents, and strategic human expert involvement to conduct comprehensive safety assessments. Our framework employs a five-phase evaluation pipeline that systematically identifies vulnerabilities across multiple safety dimensions including adversarial robustness, fairness, ethics, and accuracy. Through extensive experiments on state-of-the-art models including GPT-4o, GPT-5, Llama 3.2, Phi 3, and Claude Sonnet 4, we demonstrate that Jo.E achieves approximately 22% improvement in vulnerability detection while reducing human expert time requirements by 54% compared to traditional evaluation approaches. Our results show that automated collaborative evaluation can significantly enhance both the efficiency and effectiveness of AI safety assessment without sacrificing rigor or comprehensive coverage.
Submission Number: 8
Loading