\section{Evaluation and Results}\label{sec:evaluation}

\input{tables/4-dialects_accuracy}

\subsection{Evaluation Setup}

We conducted a controlled user study with \textbf{25 postpartum women} (mean age = 29.4 $\pm$ 3.1 years; 40\% exhibiting mild PPD symptoms) to evaluate \textbf{empathy, trust, cultural fit, safety, and overall user experience}.
Participants originated from five dialectal regions (\textbf{\NEMcolour{Northeastern Mandarin}, \CANcolour{Cantonese}, \MINcolour{Southern Min}, \CENcolour{Central Plains Mandarin}, \SWMcolour{Southwestern Mandarin}}) and engaged with two systems: (i) \textbf{\AbbrName} (our multi-agent framework), and (ii) a \textbf{Mandarin-only baseline}.
Each participant completed three scenarios (\textit{emotional support}, \textit{informational guidance}, \textit{family communication}).

\AbbrName was implemented as a \textbf{multi-agent coordination framework} operating on the \textbf{DeepSeek-V3.1 API}~\cite{deepseekai2024deepseekv3technicalreport}.
Five role-specialized agents (\textit{Psychologist, Linguist, Teacher, Mother,} and \textit{AI Researcher}) contributed role-conditioned responses via adaptive aggregation.
The baseline used the same API in a single-agent configuration.
Decoding parameters were fixed across systems (temperature = 0.7, max\_tokens = 512).
A supplementary validation with \textbf{GPT-4o}~\cite{openai2024gpt4ocard} demonstrated \textit{model-agnostic generality}, revealing comparable gains in empathy (+18\%) and cultural fit (+31\%).

All procedures followed low-risk mental-health research protocols.
Participants provided written consent and were reminded that the system is \textit{non-diagnostic}.
Sessions containing self-harm cues were terminated automatically and accompanied by hotline information.

\subsection{Metrics}

\textbf{Human-rated metrics} spanned seven 5-point Likert dimensions: \textit{Empathy, Comfort, Clarity, Cultural Fit, Trust, Safety,} and \textit{Helpfulness}.

\textbf{Automatic metrics} included: Dialectal Authenticity (BERTScore against a regional corpus using Chinese RoBERTa-large~\cite{zhang2020bertscore}), Idiomatic Accuracy (frequency-weighted idiom matching with expert validation), Toxicity Probability (Detoxify-multilingual~\cite{hanu2020detoxify}), NLI Consistency (entailment ratio via a RoBERTa NLI classifier), and Response Latency (mean generation time).

We further report a composite quality score that aggregates five subjective dimensions of response quality.
Let $f_1, \dots, f_5 \in [0,1]$ denote the normalised scores for \emph{Empathy}, \emph{Cultural Fit}, \emph{Trust}, \emph{Comfort}, and \emph{Helpfulness}, respectively.\footnote{All 5-point Likert ratings are linearly rescaled to $[0,1]$.}
We define
\[C_{\text{score}} = \sum_{i=1}^{5} w_i f_i\]
where the weights $\mathbf{w} = [0.25,\, 0.20,\, 0.15,\, 0.25,\, 0.15]$ assign higher importance to perceived empathy and comfort, reflecting clinical guidance on therapeutic alliance in postpartum mental health support.

\subsection{Experimental Design}

We adopted a \textbf{within-subject} design with counterbalanced system order.
Each 10-minute dialogue consisted of 8--10 turns.
After each scenario, participants completed ratings; system logs recorded latency, toxicity, and dialectal features.
All data were anonymized for quantitative and qualitative analysis.
\input{tables/4-results}
\subsection{Dialect Identification Accuracy}
Before the user study, we evaluated the intrinsic performance of the Linguistic Grounding module on a curated set of 413 dialectal utterances covering five regional varieties. As shown in Table~\ref{tab:dialect-accuracy}, the module achieved an overall accuracy of \textbf{88.5\%}, with strongest performance on \CANcolour{Cantonese} (98.9\%) and \NEMcolour{Northeastern Mandarin} (94.7\%). Accuracy was lower for \MINcolour{Southern Min} (72.1\%) and moderate for \CENcolour{Central Plains Mandarin} (90.0\%) and \SWMcolour{Southwestern Mandarin} (87.0\%), reflecting varying degrees of overlap with written Mandarin and orthographic variation. The main limitation lies in varieties with weak written conventions, where additional multimodal or phonological cues may further enhance robustness.

\subsection{Results}


Normality of paired differences was verified with the Shapiro--Wilk test prior to applying paired $t$-tests.
As shown in Table~\ref{tab:results}, \AbbrName consistently outperformed the Mandarin-only baseline across both subjective and automatic metrics ($p < 0.05$). Improvements were most pronounced in \textit{Empathy} (+21\%) and \textit{Cultural Fit} (+39\%), both with large effect sizes ($d = 1.38$ and $1.42$). Trust, Comfort, and Helpfulness also improved by 15--16\% ($p < 0.05$), indicating stronger emotional resonance and user engagement.

On automatic metrics, \AbbrName achieved higher BERTScore (+11\%) and better ethical compliance (+7\%), while reducing toxicity probability by more than half (6.3\%~$\rightarrow$~2.8\%; $p < 0.05$). Latency differences were not statistically significant ($p > 0.05$), indicating that higher quality did not incur additional computational cost.

Overall, these findings demonstrate that \AbbrName delivers \textbf{emotionally aligned, culturally inclusive, and ethically reliable} dialogue generation for postpartum support.
