Beyond Fluent Replies: A Corrected 500-Scenario Evaluation of Prelude as a Conversation Decision System for Emotionally Loaded Romantic Conversation

Mariia Yakovleva

Published: 08 May 2026, Last Modified: 08 May 2026OpenReview Archive Direct UploadEveryoneRevisionsCC BY-NC-ND 4.0

Abstract: Emotionally loaded romantic conversations are not simple text-generation tasks. They require strategic communication decisions under emotional pressure. This paper reports a corrected 500-scenario evaluation of Prelude, a domain-specific conversation decision system, against ChatGPT, Claude, and Gemini. A post-publication audit of the initial evaluation identified a response-generation flaw: the Claude baseline contained only 25 unique outputs repeated across 500 scenarios. We therefore regenerated Claude outputs so that each scenario received a unique response and reran the full blinded evaluation pipeline. For each scenario, the four model responses were anonymized as A/B/C/D, evaluated by Gemini 2.5 Flash using a seven-dimension weighted rubric, and later mapped back to real model identities using a private randomization key. In the corrected evaluation, Prelude achieved the highest average weighted score, 7.4171, followed by Gemini at 7.0576, ChatGPT at 6.8648, and Claude at 5.4911. Prelude was selected as the winning response in 236 of 500 scenarios. Sensitivity analysis showed that Prelude remained ranked first under all five structured weighting schemes and in 100% of 10,000 simulations that randomly changed the rubric weights. A paired comparison against Gemini showed a mean advantage of 0.3595 points, 95% CI [0.2330, 0.4861], with both paired t-test and Wilcoxon signed-rank tests significant at p < 0.000001. These results support the potential usefulness of domain-specific conversation decision systems under a structured rubric-based evaluation, while remaining limited by automated judging, simulated scenarios, and the absence of real-world relationship outcome measures.