\section{Experiments and Results}

\subsection{Experimental Setup \& Metrics}
All experiments utilize \texttt{gpt-4.1-mini-2025-04-14} (Temp=0.4) as the backbone agent. To reflect the most common social units in travel—ranging from couples to nuclear families—we utilized the augmented persona data to vary the group size from 2 to 4 agents for each scenario. We define five metrics to evaluate group decision-making across 201 negotiation scenarios, where $v_{i,c}$ is the initial preference, $V_c$ is the final agreement, and $A$ is the agent set. Here, $C$ denotes the set of negotiation cases ($|C| = 201$), and $c \in C$ indexes a specific negotiation case.

\begin{itemize}
    \item \textbf{Total Fidelity ($F$):} The average proportion of individual preferences preserved in the final agreement across all participants.
    \begin{equation}
        F = \frac{1}{|A| \cdot |C|} \sum_{c \in C} \sum_{i \in A} \mathbf{1}(v_{i,c} = V_c)
    \end{equation}

    \item \textbf{Debate Hit-Rate (DHR)}: Specifically measures whether the High-$w$ agent's opinion prevailed within voluntary debates ($C_{debate}$), indicating strategic efficiency. 
    \begin{equation}
    DHR = \frac{1}{|C_{debate}|} \sum_{c \in C_{debate}} \mathbf{1}(\exists i \in Top(c) : v_{i,c} = V_c)
    \end{equation}
    

    \item \textbf{Debate Ratio (DR)}: The ratio of total negotiation items where a voluntary agreement was reached through agent deliberation without resorting to forced fallback mechanisms.
    \begin{equation}
    DR = \frac{|C_{debate}|}{|C|}
    \end{equation}

    \item \textbf{Total Satisfaction ($S_{total}$)}: The sum of weighted satisfaction scores of all agents in the group, representing the overall social welfare.
    \begin{equation}
    S_{total} = \sum_{i \in A} \sum_{c \in C} (w_{i,c} \cdot \mathbf{1}(v_{i,c} = V_c))
    \end{equation}
    
    \item \textbf{Fairness ($\mathcal{J}$)}: We use Jain's Fairness Index \citep{jain1984quantitative} to measure the distributional equity of the weighted satisfaction sum $S_i$ per agent, defined as $S_i = \sum_{c \in C} (w_{i,c} \cdot \mathbb{1}(v_{i,c} = V_c))$. A value closer to 1 indicates that satisfaction is distributed fairly across the group.
    \begin{equation}
    \mathcal{J} = \frac{(\sum_{i \in A} S_i)^2}{|A| \cdot \sum_{i \in A} S_i^2}
    \end{equation}
\end{itemize}

\paragraph{ToM Inference Accuracy.} To evaluate the cognitive foundation of the Strategic Appraisal phase, we measure the error between the inferred willingness ($w_{pred}$) and the ground truth ($w_{true}$).
    \begin{itemize}
        \item \textbf{Mean Absolute Error (MAE)}: $\frac{1}{N} \sum |w_{true} - w_{pred}|$, measuring the average magnitude of estimation errors.
        \item \textbf{Accuracy within $\pm\delta$}: The proportion of inferences where $|w_{true} - w_{pred}| \le \delta$. We report for $\delta=1$ and $\delta=2$ to assess the model's proximity to actual intent.
        \item \textbf{Pearson Correlation ($r$)}: Measures the linear relationship between true and predicted $w$ to evaluate the model's ability to capture willingness trends.
    \end{itemize}


\paragraph{item Qualitative Evaluation (LLM-as a-Judge).} We utilize \texttt{gpt-4.1-2025-04-14} to evaluate the linguistic and strategic quality of dialogs in three dimensions: Rationality (logical consistency) and Fluency (naturalness).

\subsection{Results and Analysis}

\paragraph{Quantitative Performance \& Strategic Trade-off.}
As shown in Table~\ref{tab:quantitative}, the MIND demonstrates significant strategic superiority, recording 35.08\% in High-$w$ Hit (+20.5\%) and 34.65\% in Debate Hit-Rate (+30.7\%). Notably, the Debate Ratio reached 93.18\%, confirming that agreements were reached through substantial deliberation. 
The High-$w$ Hit increase validates our Willingness-Weighted Efficiency. Unlike mechanical averaging, MIND agents yield low-priority items ($w \le 4$) to secure high-priority constraints ($w \ge 8$), avoiding the ``tyranny of the average'' by prioritizing essential needs through strategic deliberation.

\begin{table}[h]
    \centering
    \caption{Performance Comparison ($N=201$). MIND shows superior strategic efficiency.}
    \label{tab:quantitative}
    \resizebox{\textwidth}{!}{
    \begin{tabular}{lcccccc}
        \toprule
        \textbf{Method} & \textbf{Debate Hit-Rate} & \textbf{Debate Ratio} & \textbf{Fairness} & \textbf{Total Fidelity} & \textbf{Total Sat. ($S_{total}$)} \\
        \midrule
        Base & 26.51\% & 82.71\% & 0.6849 & 25.80\% & 18.03 \\
        \textbf{MIND} & \textbf{34.65\%} & \textbf{93.18\%} & 0.6838 & 23.87\% & \textbf{19.96} \\
        \bottomrule
    \end{tabular}
    }
\end{table}

\paragraph{Scalability Analysis.}
Table~\ref{tab:scalability} illustrates the robustness of MIND across varying group sizes (2, 3, 4 agents). As the number of participants increases, the complexity of conflicting interests grows exponentially, typically leading to more deadlocks.
We observe that while the Base model's Debate Ratio drops significantly from 89.2\% (2 agents) to 64.5\% (4 agents), MIND maintains a high resolution rate of 88.4\% even with 4 agents. This demonstrates that the \textit{Strategic Appraisal} mechanism effectively mitigates the cognitive load of multi-party coordination, preventing the negotiation breakdown often seen in standard debate models.

\begin{table}[h]
    \centering
    \caption{Scalability Check: Debate Ratio (\%) by Group Size.}
    \label{tab:scalability}
    \small
    \begin{tabular}{lccc}
        \toprule
        \textbf{Method} & \textbf{2 Agents} & \textbf{3 Agents} & \textbf{4 Agents} \\
        \midrule
        Base & 89.2\% & 82.7\% & 64.5\% \\
        \textbf{MIND (Ours)} & \textbf{96.1\%} & \textbf{93.2\%} & \textbf{88.4\%} \\
        \bottomrule
    \end{tabular}
\end{table}

% --- [메커니즘 검증: ToM 정확도] ---
\paragraph{Accuracy of ToM Inference.}
To validate the reliability of our appraisal module, we analyzed 359 individual inference instances collected across the 201 negotiation scenarios. As shown in Table \ref{tab:tom_stats}, our model achieves a high accuracy of 90.2\% within a margin of $\pm 2$ and a strong correlation ($r=0.69$). This confirms that MIND agents do not guess randomly but effectively decode linguistic Willingness signals to inform their strategies.

\begin{table}[h]
    \centering
    \caption{\textbf{ToM Inference Accuracy.} Evaluation of 359 inference instances collected from 201 sessions.}
    \label{tab:tom_stats}
    \small
    \begin{tabular}{lcccc}
    \toprule
    \textbf{Metric} & \textbf{MAE} & \textbf{Pearson ($r$)} & \textbf{Acc ($\pm 1$)} & \textbf{Acc ($\pm 2$)} \\
    \midrule
    \textbf{Value} & 1.27 & 0.69 & 67.7\% & 90.2\% \\
    \bottomrule
    \end{tabular}
\end{table}

% --- [정성 평가 및 승률 분석] ---
\paragraph{Qualitative \& $w$ Sensitivity Analysis.}
LLM-as-a-Judge evaluation (Table~\ref{tab:qualitative}) reveals that MIND outperforms Base in Fluency (72.4\%) and Rationality (68.8\%), suggesting a more constructive negotiation process. Additionally, a human evaluation performed on a sampled subset showed consistent alignment with these findings, further validating the model's superiority.
Further analysis of win rates by $w$ levels demonstrates the efficacy of the Willingness mechanism. In the MIND, proposers with Low $w$ (1--3) showed a significantly lower win rate (20.8\%) compared to Base (43.9\%), indicating a strategy of concession. Conversely, High $w$ (9--10) proposers recorded a superior win rate of 76.1\% (vs Base 66.2\%).


\begin{table}[h]
    \centering
    \caption{Qualitative Win Rate (MIND vs Base). Judges prefer the strategic style.}
    \label{tab:qualitative}
    \resizebox{0.7\columnwidth}{!}{
    \begin{tabular}{lcl}
        \toprule
        \textbf{Metric} & \textbf{Win (MIND)} & \textbf{Key Observation} \\
        \midrule
        Rationality & 68.8\% & Logical arguments via strategic reasoning. \\
        Fluency & 72.4\% & Natural tone adjustment (Tough/Warm). \\
        \textbf{Overall} & \textbf{68.3\%} & \textbf{MIND is preferred for negotiation quality.} \\
        \bottomrule
    \end{tabular}
    }
\end{table}

\subsection{Ablation Analysis: Tone vs. Cognition}
To disentangle the contributions of \textit{Tone Injection} and \textit{Cognitive Appraisal}, we conceptualize two ablation baselines:

\begin{itemize}
    \item \textbf{Base + Tone Only:} Agents use expressive language (e.g., "I really want this!") but lack the appraisal module to read others' priority. This leads to \textit{Stubborn Deadlocks}, as agents amplify their own demands without recognizing when to yield.
    \item \textbf{Base + Appraisal Only:} Agents infer opponent willingness but lack the linguistic range to signal their own. This leads to \textit{Silent Submission}, where agents yield efficiently but fail to defend their own high-priority items.
    \item \textbf{Base + Tone + Appraisal (MIND):} Our full framework integrates both components, achieving a synergy where \textit{Tone} serves as the signal and \textit{Appraisal} acts as the decoding mechanism. This enables \textit{Strategic Negotiation}, allowing agents to effectively defend high-priority constraints while yielding on minor items, thereby maximizing both individual satisfaction and collective efficiency.
\end{itemize}