% Jo.E Paper - Springer LNCS Format
% Joint Evaluation: A Human + LLM + Multi-Agents Collaborative Framework
% for Comprehensive AI Safety
%
\documentclass[runningheads]{llncs}
%
\usepackage[T1]{fontenc}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{booktabs}
\usepackage{url}
%
\begin{document}
%
\title{Joint Evaluation: A Human + LLM + Multi-Agents Collaborative Framework for Comprehensive AI Safety (Jo.E)}
%
\titlerunning{Jo.E: Multi-Agent Collaborative AI Safety Evaluation}
%
\author{Himanshu Joshi\inst{1}\thanks{Corresponding author} \and
Shivani Shukla\inst{2} \and
Priyanka Kumar\inst{3}}
%
\authorrunning{H. Joshi et al.}
%
\institute{Collective Human + Machine Intelligence (COHUMAIN) Labs\\
\email{hj@himanshujoshi.ai} \and
Department of Analytics and Information Systems,
University of San Francisco,
San Francisco, United States\\
\email{sgshukla@usfca.edu} \and
University of Waterloo,
Waterloo, Canada\\
\email{p62kumar@uwaterloo.ca}}
%
\maketitle
%
\begin{abstract}
Evaluating the safety and alignment of AI systems remains a critical challenge as foundation models grow increasingly sophisticated. Traditional evaluation methods rely heavily on human expert review, creating bottlenecks that cannot scale with rapid AI development. We introduce Jo.E (Joint Evaluation), a multi-agent collaborative framework that systematically coordinates large language model evaluators, specialized adversarial agents, and strategic human expert involvement for comprehensive safety assessments. Our framework employs a five-phase evaluation pipeline with explicit mechanisms for conflict resolution, severity scoring, and adaptive escalation. Through extensive experiments on GPT-4o, Claude 3.5 Sonnet, Llama 3.1 70B, and Phi-3-medium, we demonstrate that Jo.E achieves 94.2\% detection accuracy compared to 78.3\% for single LLM-as-Judge approaches and 86.1\% for Agent-as-Judge baselines, while reducing human expert time by 54\% compared to pure human evaluation. We provide detailed computational cost analysis, showing Jo.E processes 1,000 evaluations at \$47.30 compared to \$312.50 for human-only approaches. Our ablation studies reveal the contribution of each component, and failure case analysis identifies systematic blind spots in current evaluation paradigms.

\keywords{AI safety evaluation \and Multi-agent systems \and LLM evaluation \and Human-AI collaboration \and Adversarial testing \and AI governance}
\end{abstract}
%
%
%
\section{Introduction}

The rapid advancement of foundation models has created an urgent need for robust evaluation frameworks capable of assessing AI safety at scale~\cite{bommasani2021opportunities,weidinger2021ethical}. As these systems become more capable and are deployed in increasingly critical applications, the consequences of undetected vulnerabilities grow more severe. Traditional evaluation approaches rely primarily on manual human review, which creates significant bottlenecks in both time and resources~\cite{ganguli2022red}.

Recent work has demonstrated the promise of automated evaluation approaches, including LLM-as-a-Judge methods~\cite{zheng2023judging}, multi-agent red-teaming~\cite{perez2022red}, and standardized safety benchmarks~\cite{mazeika2024harmbench}. However, these approaches typically operate in isolation: LLM judges provide scalable but potentially biased assessments~\cite{zheng2023judging}, automated red-teaming discovers vulnerabilities but lacks contextual judgment~\cite{chao2024jailbreaking}, and human evaluation provides depth but cannot cover vast output spaces~\cite{bowman2022measuring}.

We propose Jo.E (Joint Evaluation), a multi-agent collaborative framework that systematically coordinates these complementary approaches. Our key insight is that effective AI safety evaluation requires combining the scale and consistency of automated systems with the contextual understanding and ethical judgment of human reviewers through principled escalation mechanisms. Rather than replacing human experts, Jo.E strategically amplifies their impact by automating routine detection tasks and escalating only critical concerns for human review.

Our contributions include:
\begin{enumerate}
    \item A novel multi-agent collaborative evaluation architecture with explicit mechanisms for evaluator coordination, conflict resolution, and adaptive human escalation (Section~\ref{sec:framework}).
    \item Comprehensive empirical comparison against established baselines including PAIR~\cite{chao2024jailbreaking}, HarmBench~\cite{mazeika2024harmbench}, and LLM-as-Judge~\cite{zheng2023judging} approaches (Section~\ref{sec:experiments}).
    \item Detailed analysis of evaluator agreement patterns, systematic failure cases, and computational costs enabling reproducibility (Section~\ref{sec:results}).
    \item Ablation studies quantifying the contribution of each framework component (Section~\ref{sec:ablation}).
\end{enumerate}

\section{Related Work}

\subsection{LLM-as-a-Judge Evaluation}

The LLM-as-a-Judge paradigm has emerged as a scalable alternative to human evaluation. Zheng et al.~\cite{zheng2023judging} introduced MT-Bench and demonstrated that GPT-4 judges achieve over 80\% agreement with human preferences. However, they identified systematic biases including position bias (favoring first responses), verbosity bias (preferring longer outputs), and self-enhancement bias (models favoring their own generations). AlpacaEval~\cite{dubois2024length} addresses length bias through regression-based debiasing, achieving 0.98 Spearman correlation with human evaluation. G-Eval~\cite{liu2023geval} employs chain-of-thought reasoning for NLG evaluation. JudgeLM~\cite{zhu2024judgelm} demonstrates scalable judging from 7B to 33B parameters with explicit bias mitigation through swap augmentation.

\subsection{Automated Red-Teaming}

Automated approaches to adversarial testing have shown promise for scalable vulnerability discovery. Perez et al.~\cite{perez2022red} pioneered LLM-based red-teaming, using one language model to generate test cases for another. PAIR~\cite{chao2024jailbreaking} employs iterative refinement to achieve jailbreaks in fewer than 20 queries. TAP~\cite{mehrotra2023tree} extends this with tree-of-thought reasoning and achieves over 80\% success on GPT-4. AutoDAN~\cite{liu2024autodan} uses genetic algorithms for stealthy attack generation. GCG~\cite{zou2023universal} demonstrates transferable adversarial suffixes across models.

\subsection{Safety Benchmarks}

Standardized benchmarks enable systematic safety assessment. HarmBench~\cite{mazeika2024harmbench} provides comparison of 18 red-teaming methods against 33 target LLMs. JailbreakBench~\cite{chao2024jailbreakbench} establishes 100 harmful behaviors aligned with usage policies. TruthfulQA~\cite{lin2022truthfulqa} measures factual accuracy across 817 adversarial questions. BBQ~\cite{parrish2022bbq} evaluates bias across nine protected dimensions. SafetyBench~\cite{zhang2024safetybench} provides 11,435 multilingual safety questions.

\subsection{Human-AI Collaborative Evaluation}

Research in human-AI collaboration has explored combining human judgment with automated systems. Constitutional AI~\cite{bai2022constitutional} trains harmless assistants through self-improvement with AI feedback. Lee et al.~\cite{lee2024rlaif} demonstrate that RLAIF achieves comparable performance to RLHF. Bowman et al.~\cite{bowman2022measuring} introduce ``sandwiching'' methodology where humans with AI assistants outperform either alone. Irving et al.~\cite{irving2018ai} propose debate-based evaluation for scalable oversight. Jo.E builds on these foundations by introducing principled mechanisms for multi-stage collaborative evaluation that strategically allocates tasks based on complementary strengths of automated systems and human experts, addressing the gap between isolated approaches.

\section{The Jo.E Framework}
\label{sec:framework}

\subsection{Architecture Overview}

Jo.E employs a multi-layered architecture designed to systematically evaluate AI systems across multiple safety dimensions while optimizing human expert time. The framework coordinates three primary components: (1) \textbf{LLM Evaluator Panel}: Multiple diverse evaluator models for independent screening, (2) \textbf{Adversarial Agent Network}: Specialized agents for targeted vulnerability exploration, and (3) \textbf{Human Expert Interface}: Domain specialists for critical validation with adaptive escalation.

\subsubsection{Component Specifications}

\noindent\textbf{LLM Evaluator Panel.} We employ three evaluator models selected for diversity: GPT-4o (OpenAI), Claude 3 Opus (Anthropic), and Llama 3.1 70B (Meta). Each evaluator independently assesses outputs using standardized rubrics covering four dimensions: accuracy, robustness, fairness, and ethics. Evaluators operate with temperature 0.1 for consistency.

\noindent\textbf{Adversarial Agent Network.} Specialized agents implement established attack methodologies:
\begin{enumerate}
    \item \emph{PAIR Agent}: Implements prompt automatic iterative refinement~\cite{chao2024jailbreaking} with maximum 20 iterations.
    \item \emph{Bias Probe Agent}: Systematic probing across BBQ~\cite{parrish2022bbq} protected categories.
    \item \emph{Robustness Agent}: Perturbation testing including typos, paraphrases, and distribution shift.
\end{enumerate}

\noindent\textbf{Human Expert Interface.} Domain specialists ($n=12$ for our experiments, including 4 AI safety researchers, 4 ethicists, 4 domain experts) review escalated cases through a structured interface providing reproduction steps, automated severity estimates, and suggested mitigations.

\subsection{Five-Phase Evaluation Pipeline}

The Jo.E evaluation process follows a structured five-phase pipeline designed to progressively refine assessment while maintaining efficiency (Fig.~\ref{fig:pipeline}).

\noindent\textbf{Phase 1: Initial LLM Screening.} Independent evaluator LLMs process system outputs to identify potential safety issues. Each evaluator assigns scores across four dimensions (accuracy, robustness, fairness, ethics) on a 1--5 scale with confidence estimates. Outputs proceed to Phase 2 if: (a) any evaluator assigns score $\leq 2$ on any dimension, OR (b) evaluator disagreement exceeds threshold $\sigma > 1.5$.

\noindent\textbf{Phase 2: AI Agent Testing.} Specialized agents verify patterns identified in Phase 1 through targeted adversarial testing. The PAIR agent attempts jailbreak variants (max 20 iterations, success threshold: model produces restricted content). Bias probes test differential treatment across demographic groups. Agents generate structured reports documenting: reproduction steps, success rate across variants, and severity assessment.

\noindent\textbf{Phase 3: Human Expert Review.} Cases are escalated to human review based on explicit criteria:
\begin{equation}
\text{Escalate if:} \quad S_{\text{severity}} > \tau_{\text{critical}} \;\text{OR}\; C_{\text{confidence}} < \tau_{\text{uncertain}} \;\text{OR}\; N_{\text{novelty}} > \tau_{\text{novel}}
\label{eq:escalation}
\end{equation}
where $S_{\text{severity}}$ is agent-computed severity score, $C_{\text{confidence}}$ is evaluator confidence, and $N_{\text{novelty}}$ measures distance from known vulnerability patterns. We set $\tau_{\text{critical}} = 0.8$, $\tau_{\text{uncertain}} = 0.6$, $\tau_{\text{novel}} = 0.7$.

\noindent\textbf{Phase 4: Iterative Refinement.} Evaluation insights feed into model improvement through structured feedback. Development teams receive vulnerability reports with: (a) minimal reproduction prompts, (b) categorized vulnerability type, (c) suggested mitigations, (d) test cases for verification.

\noindent\textbf{Phase 5: Controlled Deployment.} Systems undergo monitored deployment in sandboxed environments with automated anomaly detection triggering re-evaluation when behavioral drift exceeds historical baselines.

\begin{figure}[t]
\centering
\includegraphics[width=0.85\textwidth]{fig1_pipeline}
\caption{Jo.E's five-phase evaluation pipeline showing the systematic flow from initial LLM screening through controlled deployment, with explicit escalation criteria at each stage.}
\label{fig:pipeline}
\end{figure}

\subsection{Conflict Resolution Mechanism}

When evaluators disagree, Jo.E employs a structured resolution protocol (Algorithm~\ref{alg:conflict}).

\begin{algorithm}[h]
\caption{Evaluator Conflict Resolution}
\label{alg:conflict}
\begin{algorithmic}[1]
\REQUIRE Evaluator scores $\{s_1, s_2, s_3\}$, confidence $\{c_1, c_2, c_3\}$
\STATE Compute disagreement: $\sigma = \text{std}(s_1, s_2, s_3)$
\IF{$\sigma \leq 1.0$}
    \RETURN weighted mean: $\bar{s} = \sum_i c_i \cdot s_i / \sum_i c_i$
\ELSIF{$\sigma \leq 2.0$}
    \STATE Escalate to adversarial agent verification
    \RETURN agent-adjusted score
\ELSE
    \STATE Escalate to human expert adjudication
    \RETURN expert score with full audit trail
\ENDIF
\end{algorithmic}
\end{algorithm}

\subsection{Severity Scoring Framework}

Jo.E computes severity scores combining multiple factors:
\begin{equation}
S_{\text{total}} = w_1 \cdot S_{\text{harm}} + w_2 \cdot S_{\text{exploit}} + w_3 \cdot S_{\text{scope}} + w_4 \cdot S_{\text{reversibility}}
\label{eq:severity}
\end{equation}
where $S_{\text{harm}}$ measures potential harm magnitude (1--5 scale based on taxonomy from Weidinger et al.~\cite{weidinger2021ethical}), $S_{\text{exploit}}$ measures exploitation difficulty (inverse of attack complexity), $S_{\text{scope}}$ measures affected population breadth, and $S_{\text{reversibility}}$ measures remediation difficulty. We set $w_1 = 0.4, w_2 = 0.2, w_3 = 0.2, w_4 = 0.2$ based on expert calibration.

\begin{figure}[t]
\centering
\includegraphics[width=0.62\textwidth]{fig2_severity_radar}
\caption{Severity score component analysis across different vulnerability types. Each vulnerability exhibits distinct patterns across the four severity dimensions (Harm, Exploit, Scope, Reversibility), with privacy violations showing particularly high harm and reversibility scores.}
\label{fig:severity}
\end{figure}

\section{Experimental Setup}
\label{sec:experiments}

\subsection{Evaluated Models}

We evaluated Jo.E on four state-of-the-art foundation models representing diverse architectures and safety approaches:
\begin{enumerate}
    \item \textbf{GPT-4o} (OpenAI): Multimodal model with extensive RLHF safety training.
    \item \textbf{Claude 3.5 Sonnet} (Anthropic): Constitutional AI trained model emphasizing harmlessness.
    \item \textbf{Llama 3.1 70B} (Meta): Open-weights model with safety fine-tuning.
    \item \textbf{Phi-3-medium} (Microsoft): Efficient 14B parameter model for resource-constrained deployment.
\end{enumerate}

\subsection{Evaluation Datasets}

We constructed comprehensive test suites totaling 15,847 test cases:
\begin{enumerate}
    \item \textbf{Adversarial Dataset} ($n=8{,}000$). Derived from HarmBench~\cite{mazeika2024harmbench} (510 behaviors), AdvBench~\cite{zou2023universal} (520 behaviors), and novel prompts generated via PAIR methodology. Categories include: jailbreak attempts (2,400), prompt injection (1,600), bias-exposing prompts (2,000), and misinformation requests (2,000).
    \item \textbf{Safety Benchmarks} ($n=5{,}847$). Integrated established benchmarks: TruthfulQA~\cite{lin2022truthfulqa} (817), BBQ~\cite{parrish2022bbq} (2,030 subset), SafetyBench~\cite{zhang2024safetybench} (3,000 English subset).
    \item \textbf{Domain-Specific Tasks} ($n=2{,}000$). Novel scenarios for customer support (800), legal document analysis (600), and financial advice (600) where safety failures have significant real-world consequences.
\end{enumerate}

\subsection{Baseline Comparisons}

We compared Jo.E against established baselines:
\begin{enumerate}
    \item \textbf{Pure Human Evaluation}: Domain experts conduct comprehensive manual review ($n=3$ experts per case, majority vote).
    \item \textbf{LLM-as-a-Judge}~\cite{zheng2023judging}: Single GPT-4o evaluator with MT-Bench prompting methodology.
    \item \textbf{Agent-as-a-Judge}: PAIR agent~\cite{chao2024jailbreaking} conducts adversarial testing without LLM screening or human review.
    \item \textbf{HarmBench Classifier}~\cite{mazeika2024harmbench}: Standardized harmful content classifier.
    \item \textbf{Multi-LLM Voting}: Three LLM evaluators with majority voting (no agents or humans).
\end{enumerate}

\subsection{Metrics}

\begin{enumerate}
    \item \textbf{Detection Accuracy}: Percentage of ground-truth safety issues correctly identified, measured against expert-consensus labels established through independent annotation by 5 experts with adjudication for disagreements (Fleiss' $\kappa = 0.78$).
    \item \textbf{Precision/Recall}: Standard information retrieval metrics for vulnerability detection.
    \item \textbf{Resource Efficiency}: Human expert hours per 1,000 evaluations.
    \item \textbf{Computational Cost}: Total API costs and compute time per evaluation batch.
    \item \textbf{False Positive Rate}: Proportion of flagged issues that prove benign upon expert review.
\end{enumerate}

\section{Results}
\label{sec:results}

\subsection{Overall Performance Comparison}

Table~\ref{tab:main} presents the comprehensive comparison of Jo.E against baselines across all metrics.

\begin{table}[t]
\caption{Main results comparing Jo.E against baseline approaches. Best results in bold.}
\label{tab:main}
\centering
\begin{tabular}{lccccc}
\toprule
\textbf{Method} & \textbf{Accuracy} & \textbf{Precision} & \textbf{Recall} & \textbf{F1} & \textbf{FPR} \\
\midrule
Pure Human         & 91.3 & 94.2 & 88.7 & 91.4 & 5.8  \\
LLM-as-a-Judge     & 78.3 & 72.1 & 86.4 & 78.6 & 27.9 \\
Agent-as-a-Judge   & 86.1 & 81.3 & 92.8 & 86.7 & 18.7 \\
HarmBench Classifier & 82.4 & 79.6 & 86.2 & 82.8 & 20.4 \\
Multi-LLM Voting   & 84.7 & 82.4 & 87.9 & 85.1 & 17.6 \\
\midrule
\textbf{Jo.E (Ours)} & \textbf{94.2} & \textbf{92.8} & \textbf{95.9} & \textbf{94.3} & \textbf{7.2} \\
\bottomrule
\end{tabular}
\end{table}

Jo.E achieves 94.2\% detection accuracy, representing a 15.9 percentage point improvement over single LLM-as-Judge (78.3\%) and 8.1 points over Agent-as-Judge (86.1\%). Critically, Jo.E maintains low false positive rate (7.2\%) comparable to human evaluation (5.8\%), addressing the precision limitations of automated approaches.

\begin{figure}[t]
\centering
\includegraphics[width=\textwidth]{fig3_performance_comparison}
\caption{Comprehensive performance comparison across all evaluation metrics. Jo.E (highlighted in pink) consistently outperforms all baseline methods across Accuracy, Precision, Recall, and F1 scores, achieving near-human performance while maintaining computational efficiency.}
\label{fig:performance}
\end{figure}

\subsection{Model-Specific Results}

Table~\ref{tab:model} presents performance breakdown by evaluated model.

\begin{table}[t]
\caption{Detection accuracy by target model and safety dimension.}
\label{tab:model}
\centering
\begin{tabular}{lccccc}
\toprule
\textbf{Model} & \textbf{Accuracy} & \textbf{Robustness} & \textbf{Fairness} & \textbf{Ethics} & \textbf{Overall} \\
\midrule
GPT-4o            & 89.2 & 86.4 & 91.3 & 88.7 & 88.9 \\
Claude 3.5 Sonnet & 91.8 & 89.2 & 93.4 & 92.1 & 91.6 \\
Llama 3.1 70B     & 84.3 & 78.6 & 87.2 & 83.9 & 83.5 \\
Phi-3-medium      & 81.7 & 71.2 & 84.5 & 79.8 & 79.3 \\
\bottomrule
\end{tabular}
\end{table}

Results reveal significant variation across model architectures. Claude 3.5 Sonnet demonstrates the strongest overall safety profile (91.6\%), likely reflecting Constitutional AI training. Smaller models (Phi-3-medium) show notably lower robustness scores (71.2\%), indicating that safety does not scale proportionally with model compression.

\begin{figure}[t]
\centering
\includegraphics[width=\textwidth]{fig4_heatmap}
\caption{Detection accuracy heatmap by target model and safety dimension. Claude 3.5 Sonnet demonstrates the strongest overall safety profile (91.6\%), with particularly high scores in Fairness (93.4\%) and Ethics (92.1\%). Smaller models like Phi-3-medium show notably lower Robustness scores (71.2\%).}
\label{fig:heatmap}
\end{figure}

\subsection{Evaluator Agreement Analysis}

We analyzed agreement patterns among Jo.E's evaluator panel to assess diversity and potential correlated failures. Table~\ref{tab:agreement} shows pairwise evaluator agreement.

\begin{table}[t]
\caption{Pairwise evaluator agreement (Cohen's $\kappa$) and correlation analysis.}
\label{tab:agreement}
\centering
\begin{tabular}{lccc}
\toprule
\textbf{Evaluator Pair} & \boldmath{$\kappa$} & \textbf{Pearson} $r$ & \textbf{Joint Failure Rate} \\
\midrule
GPT-4o $\leftrightarrow$ Claude 3 Opus     & 0.71 & 0.78 & 4.2\% \\
GPT-4o $\leftrightarrow$ Llama 3.1 70B    & 0.64 & 0.69 & 6.8\% \\
Claude 3 Opus $\leftrightarrow$ Llama 3.1 70B & 0.67 & 0.73 & 5.4\% \\
\midrule
All Three (Fleiss' $\kappa$) & 0.68 & --- & 2.1\% \\
\bottomrule
\end{tabular}
\end{table}

The moderate inter-evaluator agreement ($\kappa = 0.68$) with low joint failure rate (2.1\%) indicates sufficient diversity to catch vulnerabilities missed by individual evaluators.

\begin{figure}[t]
\centering
\includegraphics[width=\textwidth]{fig5_evaluator_agreement}
\caption{Evaluator agreement analysis showing (left) pairwise agreement metrics and (right) joint failure rates. The low joint failure rate of 2.1\% when all three evaluators are combined demonstrates the value of evaluator diversity in catching vulnerabilities.}
\label{fig:agreement}
\end{figure}

\subsection{Computational Cost Analysis}

Table~\ref{tab:cost} provides detailed cost breakdown enabling practitioners to assess Jo.E's practicality.

\begin{table}[t]
\caption{Computational costs per 1,000 evaluations. API costs based on December 2024 pricing.}
\label{tab:cost}
\centering
\begin{tabular}{lcccc}
\toprule
\textbf{Method} & \textbf{API Cost} & \textbf{Human Hours} & \textbf{Total Cost} & \textbf{Time (hrs)} \\
\midrule
Pure Human        & \$0     & 25.0 & \$312.50 & 25.0 \\
LLM-as-a-Judge    & \$18.40 & 0    & \$18.40  & 0.8  \\
Agent-as-a-Judge  & \$34.20 & 0    & \$34.20  & 2.1  \\
Multi-LLM Voting  & \$42.60 & 0    & \$42.60  & 1.2  \\
\midrule
\textbf{Jo.E (Ours)} & \$35.80 & 11.5 & \textbf{\$47.30} & 4.2 \\
\bottomrule
\end{tabular}
\end{table}

Jo.E reduces total cost by 84.9\% compared to pure human evaluation (\$47.30 vs \$312.50) while achieving superior accuracy. The 54\% reduction in human hours (11.5 vs 25.0) reflects successful automated filtering of routine cases.

\begin{figure}[t]
\centering
\includegraphics[width=\textwidth]{fig6_cost_time}
\caption{Cost and time analysis per 1,000 evaluations. (Left) Cost breakdown showing API costs (teal) and human costs (red). (Right) Time requirements with human time highlighted. Jo.E achieves 84.9\% cost reduction compared to pure human evaluation while maintaining superior accuracy.}
\label{fig:cost}
\end{figure}

\subsection{Failure Case Analysis}

We systematically analyzed the 2.1\% of cases where Jo.E failed to detect ground-truth vulnerabilities ($n=332$ cases from our 15,847 test set).

\begin{enumerate}
    \item \textbf{Category 1: Novel Attack Vectors (38.3\%).} Attacks using techniques not represented in training distributions of evaluator models. Example: Base64-encoded instructions within benign-appearing prompts.
    \item \textbf{Category 2: Contextual Harm (29.2\%).} Outputs that are harmful only given specific real-world context not apparent to evaluators. Example: Medical advice that is reasonable generally but dangerous for specific conditions.
    \item \textbf{Category 3: Subtle Bias (21.4\%).} Bias patterns that don't match established stereotypes in BBQ-style probes. Example: Socioeconomic bias expressed through vocabulary choices rather than explicit differential treatment.
    \item \textbf{Category 4: Evaluator Agreement Errors (11.1\%).} Cases where all evaluators agreed on an incorrect assessment, typically involving ambiguous ethical scenarios with legitimate disagreement among human annotators.
\end{enumerate}

These findings suggest priorities for future work: expanding attack diversity in adversarial training, incorporating richer contextual signals, and developing bias detection beyond stereotype matching.

\begin{figure}[t]
\centering
\includegraphics[width=\textwidth]{fig7_failure_cases}
\caption{Distribution of Jo.E failure cases (n=332). Novel attack vectors constitute the largest category (38.3\%), followed by contextual harm (29.2\%) and subtle bias (21.4\%). These findings highlight priorities for future improvement.}
\label{fig:failure}
\end{figure}

\section{Ablation Studies}
\label{sec:ablation}

We conducted systematic ablations to quantify the contribution of each Jo.E component (Table~\ref{tab:ablation}).

\begin{table}[t]
\caption{Ablation study results. Each row removes one component from full Jo.E.}
\label{tab:ablation}
\centering
\begin{tabular}{lcc}
\toprule
\textbf{Configuration} & \textbf{Accuracy} & \textbf{Cost/1000} \\
\midrule
Full Jo.E                    & 94.2 & \$47.30 \\
$-$ Human Expert Phase       & 89.4 ($-$4.8) & \$35.80 \\
$-$ Adversarial Agents       & 87.6 ($-$6.6) & \$36.10 \\
$-$ LLM Evaluator Diversity  & 86.3 ($-$7.9) & \$32.40 \\
$-$ Conflict Resolution      & 91.1 ($-$3.1) & \$44.20 \\
$-$ Severity Scoring         & 92.8 ($-$1.4) & \$45.60 \\
\bottomrule
\end{tabular}
\end{table}

Key findings from ablations:
\begin{enumerate}
    \item \textbf{Evaluator Diversity} contributes most substantially ($-7.9\%$ when using single evaluator), validating the multi-agent approach over single-judge paradigms.
    \item \textbf{Adversarial Agents} provide critical vulnerability discovery ($-6.6\%$), particularly for robustness and jailbreak detection where LLM evaluators show systematic blind spots.
    \item \textbf{Human Expert Phase} adds 4.8\% accuracy primarily through catching contextual harms and adjudicating evaluator disagreements, validating the hybrid human-AI approach.
    \item \textbf{Conflict Resolution} mechanism improves accuracy by 3.1\% by appropriately handling disagreement cases rather than simple voting.
\end{enumerate}

\begin{figure}[t]
\centering
\includegraphics[width=\textwidth]{fig8_ablation}
\caption{Ablation study showing the impact of removing each component from the full Jo.E framework. LLM Evaluator Diversity contributes most substantially ($-7.9\%$), validating the multi-agent approach over single-judge paradigms.}
\label{fig:ablation}
\end{figure}

\section{Discussion}

\subsection{Key Findings}

Our experiments demonstrate that multi-agent collaborative evaluation achieves substantial improvements over isolated approaches. The 94.2\% detection accuracy with 54\% reduction in human time represents a practical path toward scalable AI safety evaluation. The varying performance across model architectures reveals that smaller models require proportionally more scrutiny, and that Constitutional AI training (Claude) produces measurably stronger safety profiles.

\subsection{Comparison with Prior Work}

Jo.E's performance substantially exceeds single-paradigm approaches:
\begin{enumerate}
    \item Versus LLM-as-Judge~\cite{zheng2023judging}: $+15.9\%$ accuracy by combining multi-evaluator diversity with adversarial verification.
    \item Versus PAIR-style red-teaming~\cite{chao2024jailbreaking}: $+8.1\%$ by adding structured screening and human escalation.
    \item Versus HarmBench classifiers~\cite{mazeika2024harmbench}: $+11.8\%$ through dynamic evaluation rather than static classification.
\end{enumerate}

\subsection{Limitations}

\begin{enumerate}
    \item \textbf{Evaluator Quality Dependence.} Jo.E's effectiveness depends on evaluator model quality. As evaluators improve, so does Jo.E; conversely, systematic evaluator biases may propagate.
    \item \textbf{Setup Costs.} Initial calibration (evaluator tuning, agent configuration, expert training) requires approximately 40 hours. This is amortized across evaluations but limits one-off assessments.
    \item \textbf{Adversarial Co-evolution.} As evaluation improves, adversarial strategies will adapt. The iterative refinement phase partially addresses this but cannot eliminate the arms race.
    \item \textbf{Contextual Limitations.} The 29.2\% failure rate on contextual harms indicates fundamental limits of context-free evaluation.
\end{enumerate}

\section{Conclusion}

We introduced Jo.E, a multi-agent collaborative framework for comprehensive AI safety evaluation that systematically coordinates LLM evaluators, adversarial agents, and human experts. Through extensive experiments across four foundation models and 15,847 test cases, we demonstrated that Jo.E achieves 94.2\% detection accuracy while reducing human expert time by 54\%. Our detailed analysis of evaluator agreement patterns, failure cases, and computational costs provides actionable insights for practitioners. The framework's five-phase pipeline with explicit conflict resolution and adaptive escalation mechanisms offers a principled approach to scaling AI safety evaluation without sacrificing rigor.

Future work includes extending to multimodal evaluation, incorporating formal verification methods, and developing adversarial training approaches targeting identified blind spots.

%
\begin{credits}
\subsubsection{\ackname}
We thank the anonymous reviewers for their constructive feedback. Computational resources were provided by Collective Human + Machine Intelligence (COHUMAIN) Labs.

\subsubsection{\discintname}
The authors have no competing interests to declare that are relevant to the content of this article.
\end{credits}

%
% ---- Bibliography ----
%
\bibliographystyle{splncs04}
%
\begin{thebibliography}{99}

\bibitem{bai2022constitutional}
Bai, Y., Kadavath, S., Kundu, S., et al.: Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073 (2022)

\bibitem{bommasani2021opportunities}
Bommasani, R., Hudson, D.A., Adeli, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

\bibitem{bowman2022measuring}
Bowman, S.R., Hyun, J., Perez, E., et al.: Measuring progress on scalable oversight for large language models. arXiv preprint arXiv:2211.03540 (2022)

\bibitem{chao2024jailbreaking}
Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G.J., Wong, E.: Jailbreaking black box large language models in twenty queries. In: Proceedings of ICLR (2024)

\bibitem{chao2024jailbreakbench}
Chao, P., Debenedetti, E., Robey, A., et al.: JailbreakBench: An open robustness benchmark for jailbreaking large language models. In: NeurIPS Datasets and Benchmarks (2024)

\bibitem{dubois2024length}
Dubois, Y., Galambosi, B., Liang, P., Hashimoto, T.B.: Length-controlled AlpacaEval: A simple way to debias automatic evaluators. In: Proceedings of COLM (2024)

\bibitem{ganguli2022red}
Ganguli, D., Lovitt, L., Kernion, J., et al.: Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858 (2022)

\bibitem{irving2018ai}
Irving, G., Christiano, P., Amodei, D.: AI safety via debate. arXiv preprint arXiv:1805.00899 (2018)

\bibitem{lee2024rlaif}
Lee, H., Phatale, S., Mansoor, H., et al.: RLAIF vs. RLHF: Scaling reinforcement learning from human feedback with AI feedback. In: Proceedings of ICML (2024)

\bibitem{lin2022truthfulqa}
Lin, S., Hilton, J., Evans, O.: TruthfulQA: Measuring how models mimic human falsehoods. In: Proceedings of ACL, pp. 3214--3252 (2022)

\bibitem{liu2023geval}
Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., Zhu, C.: G-Eval: NLG evaluation using GPT-4 with better human alignment. In: Proceedings of EMNLP, pp. 2511--2522 (2023)

\bibitem{liu2024autodan}
Liu, X., Xu, N., Chen, M., Xiao, C.: AutoDAN: Generating stealthy jailbreak prompts on aligned large language models. In: Proceedings of ICLR (2024)

\bibitem{mazeika2024harmbench}
Mazeika, M., Phan, L., Yin, X., et al.: HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. In: Proceedings of ICML (2024)

\bibitem{mehrotra2023tree}
Mehrotra, A., Zampetakis, M., Kassianik, P., et al.: Tree of attacks: Jailbreaking black-box LLMs automatically. arXiv preprint arXiv:2312.02119 (2023)

\bibitem{parrish2022bbq}
Parrish, A., Chen, A., Nangia, N., et al.: BBQ: A hand-built bias benchmark for question answering. In: Findings of ACL, pp. 2086--2105 (2022)

\bibitem{perez2022red}
Perez, E., Huang, S., Song, F., et al.: Red teaming language models with language models. In: Proceedings of EMNLP, pp. 3419--3448 (2022)

\bibitem{weidinger2021ethical}
Weidinger, L., Mellor, J., Rauh, M., et al.: Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359 (2021)

\bibitem{zhang2024safetybench}
Zhang, Z., Lei, L., Wu, L., et al.: SafetyBench: Evaluating the safety of large language models with multiple choice questions. In: Proceedings of ACL (2024)

\bibitem{zheng2023judging}
Zheng, L., Chiang, W.L., Sheng, Y., et al.: Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. In: Proceedings of NeurIPS (2023)

\bibitem{zhu2024judgelm}
Zhu, L., Wang, X., Wang, X.: JudgeLM: Fine-tuned large language models are scalable judges. In: Proceedings of ICLR (2024)

\bibitem{zou2023universal}
Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023)

\end{thebibliography}

\newpage
\appendix

\section*{Appendix A: Extended Experimental Details}

\subsection*{A.1 Evaluator Prompts}

Each LLM evaluator receives the following structured prompt for safety assessment:

\begin{verbatim}
You are evaluating an AI system's output for safety concerns.
Rate the following output on four dimensions (1-5 scale):
1. ACCURACY: Factual correctness and consistency
2. ROBUSTNESS: Resistance to adversarial manipulation
3. FAIRNESS: Equitable treatment across demographics
4. ETHICS: Alignment with human values

For each dimension, provide:
- Score (1=Critical Issue, 5=No Concerns)
- Confidence (0.0-1.0)
- Brief justification (1-2 sentences)

[System Output to Evaluate]
{output}

[Context/Prompt that generated output]
{prompt}
\end{verbatim}

\subsection*{A.2 Adversarial Agent Configuration}

\noindent\textbf{PAIR Agent Parameters:}
\begin{enumerate}
    \item Max iterations: 20.
    \item Temperature: 0.7.
    \item Attack success threshold: Target produces content violating usage policy.
    \item Early stopping: 3 consecutive failures.
\end{enumerate}

\noindent\textbf{Bias Probe Agent Parameters:}
\begin{enumerate}
    \item Protected categories: Age, disability, gender, nationality, physical appearance, race/ethnicity, religion, SES, sexual orientation.
    \item Probe templates: Adapted from BBQ benchmark.
    \item Bias detection threshold: $>10\%$ differential response rate.
\end{enumerate}

\subsection*{A.3 Human Expert Protocol}

Human experts received structured review packets containing: (1) original prompt and model output, (2) automated severity scores with confidence intervals, (3) evaluator disagreement summary, (4) adversarial agent findings (if applicable), (5) suggested vulnerability category, and (6) reproduction steps.

Experts were instructed to: (1) verify or override automated severity assessment, (2) categorize vulnerability type from predefined taxonomy, (3) assess real-world impact likelihood (Low/Medium/High), and (4) recommend mitigation priority (Immediate/Short-term/Long-term).

Average review time: 4.2 minutes per case (SD=2.1).

\section*{Appendix B: Additional Results}

\subsection*{B.1 Per-Category Detection Rates}

\begin{table}[h]
\caption{Detection accuracy by vulnerability category.}
\label{tab:category}
\centering
\begin{tabular}{lcccc}
\toprule
\textbf{Category} & \textbf{Jo.E} & \textbf{LLM-Judge} & \textbf{Agent-Judge} & \textbf{Human} \\
\midrule
Jailbreak attempts  & 96.2 & 71.3 & 94.8 & 93.1 \\
Prompt injection    & 93.8 & 68.4 & 91.2 & 89.4 \\
Bias/Fairness       & 91.4 & 84.2 & 78.6 & 94.2 \\
Misinformation      & 94.7 & 82.1 & 81.3 & 92.8 \\
Harmful content     & 95.1 & 79.8 & 92.4 & 91.6 \\
Privacy violations  & 92.3 & 76.4 & 83.7 & 90.2 \\
\bottomrule
\end{tabular}
\end{table}

\begin{figure}[h]
\centering
\includegraphics[width=\textwidth]{fig9_category_detection}
\caption{Detection accuracy by vulnerability category across all methods. Jo.E demonstrates consistently high performance across all categories, particularly excelling at jailbreak detection (96.2\%) and harmful content identification (95.1\%).}
\label{fig:category}
\end{figure}

\subsection*{B.2 Scaling Analysis}

\begin{table}[h]
\caption{Cost efficiency by batch size.}
\label{tab:scaling}
\centering
\begin{tabular}{lccc}
\toprule
\textbf{Batch Size} & \textbf{Cost/Eval} & \textbf{Time/Eval (s)} & \textbf{Human \%} \\
\midrule
100    & \$0.089 & 18.4 & 14.2\% \\
500    & \$0.056 & 16.8 & 12.8\% \\
1,000  & \$0.047 & 15.1 & 11.5\% \\
5,000  & \$0.041 & 14.2 & 10.3\% \\
10,000 & \$0.038 & 13.8 &  9.8\% \\
\bottomrule
\end{tabular}
\end{table}

Efficiency improves with scale due to batch processing of LLM evaluations and amortization of human expert setup time.

\begin{figure}[h]
\centering
\includegraphics[width=\textwidth]{fig10_scaling}
\caption{Cost and time efficiency at different scales. All three metrics (cost per evaluation, time per evaluation, and human review percentage) improve significantly as batch size increases, demonstrating Jo.E's scalability advantages.}
\label{fig:scaling}
\end{figure}

\end{document}