{
  "title": "Paper Summary and Key Results",
  "turns": [
    {
      "turn": 1,
      "user": "> 2507.10918v1 - PDF\n\nPlease read this paper.",
      "assistant": "A new paper titled \"How Stylistic Similarity Shapes Preferences in Dialogue Dataset with User and Third Party Evaluations\" was published on July 15, 2025, by researchers from Tohoku University, CyberAgent, NINJAL, and RIKEN. The research investigates how the similarity in style between a user and a dialogue system affects the user's preferences.\n\nThe study's main contribution is the creation and analysis of a new dataset called DUO (Dialogue dataset with User subjective and Objective evaluations). This dataset is designed to untangle the difference between a user's own perception of stylistic similarity and an objective assessment made by outside observers.\n\n### Methodology\n\n**Dataset Construction:** The researchers constructed the DUO dataset by collecting human-bot dialogues in two distinct open-domain settings:\n\n*   **Empathetic Dialogues (ED):** In this setting, the user shares a personal emotional experience, and the system's role is to be an empathetic listener.\n*   **Wizard of Wikipedia (WoW):** Here, the system acts as a knowledgeable \"wizard\" on a specific topic, while the user is an \"apprentice\" asking questions and exploring the topic.\n\nDialogues were collected using two large language models, GPT-40 and Llama-3.1-70B-Instruct, with 39 crowd workers from the US, Canada, and the UK participating. In total, 314 dialogues were gathered, split evenly between the ED and WoW settings.\n\n**Evaluation Process:** A key feature of this study is its dual evaluation approach:\n\n*   **Subjective Evaluation:** After each dialogue, the participating user rated the conversation on several metrics, including their preference and their perception of stylistic similarity. This is termed \"subjective stylistic similarity\".\n*   **Objective Evaluation:** The dialogues were also evaluated by independent third-party annotators who did not participate in the conversations. Their assessment of stylistic similarity is referred to as \"objective stylistic similarity\".\n\nBoth users and third-party annotators rated dialogues on a five-point Likert scale across several criteria:\n\n*   **Preference (PREF)**\n*   **Stylistic Similarity (STYL)**\n*   **Consistency (CONS)**\n*   **Empathy (EMP)** for the ED setting\n*   **Engagingness (ENGAG)** for the WoW setting\n\n### Key Findings\n\nThe analysis of the DUO dataset revealed several important insights:\n\n1.  **Subjective Similarity Drives Preference:** There is a strong positive correlation between a user's own perception of stylistic similarity ( $STYL_{sb}$ ) and their overall preference for the dialogue ( $PREF_{sb}$ ). The Spearman correlation coefficient was 0.75 for the ED setting and 0.67 for the WoW setting. This suggests that when users feel the system's style matches their own, they have a more favorable impression.\n2.  **Gap Between Subjective and Objective Views:** The study found a significant discrepancy between how users perceive stylistic similarity and how third-party annotators assess it. The correlation between subjective stylistic similarity (\n     $STYL_{sb}$ ) and objective stylistic similarity ( $STYL_{ob}$ ) was low and not statistically significant. This indicates that users and outside observers focus on different cues when judging style. For instance, a user might perceive high similarity based on emotional alignment, while an annotator might focus on subtle differences in formality.\n3.  **Objective Similarity and Preference are Weakly Linked:** In contrast to the subjective findings, the correlation between objective stylistic similarity ( $STYL_{ob}$ ) and user preference ( $PREF_{sb}$ ) was weak. This challenges the assumption that third-party evaluations of style can accurately predict a user's actual preference.\n4.  **Consistency as a Prerequisite:** The analysis showed that dialogue consistency is a foundational requirement for a positive user experience. Dialogues rated low on consistency also received low scores for preference and stylistic similarity. However, high consistency did not guarantee high preference, suggesting it's a necessary but not sufficient condition.\n5.  **Automatic Evaluation Mirrors Objective Ratings:** An automatic evaluation using an LLM-based method (LLM-Eval) found that its scores for stylistic similarity correlated more strongly with the objective, third-party ratings than with the users' subjective ratings.\n\n### Conclusion and Contributions\n\nThe authors conclude that distinguishing between subjective and objective evaluations is crucial for understanding user preferences in dialogue systems. The research underscores that a user's personal feeling of stylistic alignment is a much stronger driver of their satisfaction than an objective, external measure of that same alignment.\n\nThe key contributions of this paper are:\n\n*   The development of the DUO dataset, which includes both subjective user evaluations and objective third-party evaluations of dialogue preference and stylistic similarity.\n*   An empirical demonstration of the strong link between user-perceived stylistic similarity and their own preferences.\n*   The identification of a clear discrepancy between subjective and objective evaluations of stylistic similarity, highlighting the need to consider these perspectives separately.\n\nThe dataset is available for non-commercial research purposes to encourage further study into these nuanced aspects of human-computer interaction."
    },
    {
      "turn": 2,
      "user": "In Korean",
      "assistant": "A new paper titled \"How Stylistic Similarity Shapes Preferences in Dialogue Datasets with User and Third-Party Evaluations\" was published on July 15, 2025, by researchers from Tohoku University, CyberAgent, NINJAL, and RIKEN. This study investigates how stylistic similarity between a user and a dialogue system affects user preferences.\n\nThe main contribution of the study is the construction and analysis of a new dataset called DUO (Dialogue dataset with User subjective and Objective evaluations). This dataset is designed to analyze the difference between a user's own perception of stylistic similarity and the objective evaluation made by outside observers.\n\n### Methodology\n\n**Dataset construction:** The researchers built the DUO dataset by collecting human-bot dialogues in two distinct open-domain settings:\n\n*   **Empathetic Dialogues (ED):** In this setting, the user shares a personal emotional experience, and the system plays the role of an empathetic listener.\n*   **Wizard of Wikipedia (WoW):** Here, the system acts as a knowledgeable \"wizard\" on a specific topic, while the user is an \"apprentice\" who asks questions and explores the topic.\n\nDialogues were collected using two large language models, GPT-40 and Llama-3.1-70B-Instruct, with 39 crowd workers from the US, Canada, and the UK participating. A total of 314 dialogues were gathered, evenly split between the ED and WoW settings.\n\n**Evaluation process:** A key feature of this study is its dual evaluation approach:\n\n*   **Subjective evaluation:** After each dialogue, the participating user rated the conversation on several metrics, including their preference and their perception of stylistic similarity. This is referred to as \"subjective stylistic similarity.\"\n*   **Objective evaluation:** The dialogues were also evaluated by independent third-party annotators who did not participate in the conversations. Their assessment of stylistic similarity is referred to as \"objective stylistic similarity.\"\n\nBoth users and third-party annotators rated the dialogues on a five-point Likert scale across several criteria:\n\n*   **Preference (PREF)**\n*   **Stylistic Similarity (STYL)**\n*   **Consistency (CONS)**\n*   For the ED setting\n    **Empathy (EMP)**\n*   For the WoW setting\n    **Engagingness (ENGAG)**\n\n### Key Findings\n\nAnalysis of the DUO dataset revealed several important insights:\n\n1.  **Subjective similarity drives preference:** There is a strong positive correlation between a user's own perception of stylistic similarity ( $STYL_{sb}$ ) and their overall preference for the dialogue ( $PREF_{sb}$ ). The Spearman correlation coefficient was 0.75 in the ED setting and 0.67 in the WoW setting. This suggests that when users feel the system's style matches their own, they form a more favorable impression.\n2.  **A gap between subjective and objective views:** The study found a significant discrepancy between how users perceive stylistic similarity and how third-party annotators assess it. The correlation between subjective stylistic similarity (\n     $STYL_{sb}$ ) and objective stylistic similarity ( $STYL_{ob}$ ) was low and not statistically significant. This indicates that users and outside observers focus on different cues when judging style. For example, a user might perceive high similarity based on emotional alignment, while an annotator might focus on subtle differences in formality.\n3.  **Objective similarity is weakly linked to preference:** In contrast to the subjective findings, the correlation between objective stylistic similarity ( $STYL_{ob}$ ) and user preference ( $PREF_{sb}$ ) was weak. This challenges the assumption that third-party style evaluations can accurately predict a user's actual preference.\n4.  **Consistency as a prerequisite:** The analysis showed that dialogue consistency is a foundational requirement for a positive user experience. Dialogues with low consistency scores also received low scores for preference and stylistic similarity. However, high consistency did not guarantee high preference, suggesting it is necessary but not sufficient.\n5.  **Automatic evaluation mirrors objective ratings:** An automatic evaluation using an LLM-based method (LLM-Eval) found that its scores for stylistic similarity correlated more strongly with the objective third-party ratings than with users' subjective ratings.\n\n### Conclusion and Contributions\n\nThe authors conclude that distinguishing between subjective and objective evaluations is crucial for understanding user preferences in dialogue systems. The study highlights that a user's personal sense of stylistic alignment is a much stronger driver of satisfaction than an objective, external measure of the same alignment.\n\nThe main contributions of this paper are:\n\n*   The development of the DUO dataset, which includes both subjective user evaluations and objective third-party evaluations of dialogue preference and stylistic similarity.\n*   An empirical demonstration of the strong association between user-perceived stylistic similarity and their own preferences.\n*   The identification of a clear discrepancy between subjective and objective evaluations of stylistic similarity, underscoring the need to consider these perspectives separately.\n\nThis dataset is available for non-commercial research purposes to encourage further study of these nuanced aspects of human-computer interaction."
    },
    {
      "turn": 3,
      "user": "Please outline the introduction flow of this paper",
      "assistant": "Sure, the paper’s Introduction is structured as follows.\n\n1.  **Background: Advances in dialogue generation technology**\n    *   Recent advances in dialogue generation have greatly improved systems’ ability to produce context-appropriate, natural responses.\n    *   Building on this, the research focus has shifted from merely producing appropriate replies to “personalization,” i.e., adapting system outputs to each user’s preferences.\n2.  **Problem statement: Ambiguity in evaluating stylistic similarity**\n    *   Prior studies have suggested that the more similar the styles of the system and the user (stylistic similarity), the more positive an impression users may form of the system.\n    *   However, the crucial question of \"who\" is recognizing stylistic similarity is often overlooked.\n    *   Many studies rely on third-party raters to judge stylistic similarity, but it has been unclear whether those third-party judgments—or users’ own perceptions—actually reflect users’ preferences.\n3.  **Approach: Building a new dataset**\n    *   To explore this issue, the present study constructs a new dialogue dataset.\n    *   The dataset includes (1) user preference, (2) the user’s own perceived ‘subjective stylistic similarity,’ and (3) ‘objective stylistic similarity’ rated by third-party annotators.\n    *   It is built under two open-domain settings: ‘Empathetic Dialogues,’ which focuses on empathetic communication, and ‘Wizard of Wikipedia,’ a knowledge-grounded dialogue setting.\n4.  **Key claim and preview of findings**\n    *   Analysis of the constructed dataset shows a strong positive correlation between\n        **‘subjective’ stylistic similarity** and user preference.\n    *   Furthermore, it highlights the important finding that\n        **‘subjective’ similarity perceived by users differs from third-party objective similarity**.\n    *   In sum, when analyzing the relationship between stylistic similarity and user preference, it is crucial to clearly distinguish subjective and objective evaluations and to understand the unique facets each captures.\n\nThrough this flow, the paper underscores the importance of personalization in dialogue-system research and proposes a new direction: clarifying whose perspective should be used to evaluate “stylistic similarity.”"
    },
    {
      "turn": 4,
      "user": "> 2410.02503v1 - PDF\n\nPlease read this paper",
      "assistant": "Sure, I'll summarize the paper for you.\n\n### Paper Summary: Mixed-Session Conversation using Egocentric Memory (Mixed-Session Conversation with Egocentric Memory)\n\nThis paper, released on October 3, 2024 by a research team at the UNIST Graduate School of Artificial Intelligence, proposes a new paradigm to overcome the limitations of existing conversational systems.\n\n#### Limitations of prior work and problem statement\n\nExisting dialogue systems have struggled to engage in dynamic, ongoing interactions over long periods while meeting multiple conversation partners, as humans do. Prior research has mainly progressed along two directions:\n\n*   **Depth:** 'Multi-Session Conversation' that continues dialogue with one partner across multiple sessions\n*   **Width:** 'Multi-Party Conversation' where multiple partners participate in a single session\n\nHowever, real conversations combine both aspects, so focusing on only one cannot fully capture how conversations work in the real world.\n\n#### Proposal: 'Mixed-Session Conversation' and 'Egocentric Memory'\n\nTo overcome these limitations, the team introduces a new conversational paradigm called \\*\\*'Mixed-Session Conversation (MIXED-SESSION CONVERSATION)'\\*\\*. In this system, a single main speaker engages across multiple sessions\n\n**with different partners**.\n\nTo realize this, they propose two core components:\n\n1.  **MISC dataset:** a newly constructed dialogue dataset to implement 'Mixed-Session Conversation'.\n2.  **EMMA model:** a dialogue model that uses a new memory management mechanism called \\*\\*Egocentric Memory (Egocentric Memory)\\*\\*. 'Egocentric Memory' collects and maintains the content of conversations with each partner from the main speaker's point of view, allowing them to keep context and maintain a natural flow even when later conversing with a different partner.\n\n#### How the MISC dataset was built\n\nThe MISC dataset was built using large language models (LLMs) and consists of 8,556 episodes (a total of 51,336 sessions). Each episode comprises six consecutive sessions involving one main speaker and three partners.\n\nThe construction process is as follows:\n\n1.  **Topic collection:** Generate conversation topics from everyday-life keywords (health, travel, etc.).\n2.  **Scenario collection:** Based on the generated topics, specify details for four speakers (name, occupation, relationships, etc.) and the event content of each of the six sessions.\n3.  **Dialogue generation:** Generate each session's conversation sequentially according to the scenario. Use session summaries and 'Egocentric Memory' to ensure continuity across sessions.\n4.  **Memory generation and connection:**\n    *   **Memory Generation:** At the end of each session, summarize important events, emotions, etc., from the main speaker's perspective and store them as memory.\n    *   **Memory Connection:** Link the generated memories to manage them in an integrated, contextual way rather than merely listing them. This is a key feature of the MISC dataset that enables effective tracking and updating of related memories accumulated across sessions.\n    *   **Memory Tagging:** Tag which memories each generated utterance refers to, reinforcing a natural conversational flow.\n\n#### Architecture of the EMMA model\n\nEMMA consists of \\*\\*Dialogue Module (Dialogue Module)\\*\\* and\n\n**Retrieval Module (Retrieval Module)** making up two parts.\n\n*   **Dialogue Module:** Based on the FLAN-T5-Large model, it handles all tasks except retrieval, including dialogue generation, memory summarization, and memory connection. In particular, instead of deleting or directly editing memories, it 'connects' related memories so the model can reference the most up-to-date state without information loss.\n*   **Retrieval Module:** Based on the BERT-base model, it retrieves the memories most relevant to the current conversation from those built in previous sessions. It brings not only the most relevant memory but also the other memories 'connected' to it, providing richer, expanded context.\n\n#### Experiments and results\n\nThe team conducted human evaluations via a professional assessment agency to verify the quality of the MISC dataset and the EMMA model.\n\n*   **MISC dataset quality:** Received very high scores on 'Consistency' and 'Coherence' (around 4.8 and 4.9, respectively), demonstrating that it maintains a natural flow even when the conversation partner changes from session to session.\n*   **EMMA model performance:** Scored highly (above 4.5) on 'Humanness', 'Engagingness', and 'Memorability', showing that EMMA effectively leverages egocentric memory to produce human-like conversations.\n*   **Comparison with existing models:** Existing multi-session models (MSC 2.7B, ReBot) failed to notice when the partner changed from, say, a patient to a spouse and kept responding as if to the patient, whereas EMMA correctly detected the partner change and generated context-appropriate dialogue.\n*   **Ablation Study:** While simple summary-based models compress only factual content, EMMA's egocentric memory preserves specific figures such as score changes and even the speaker's thoughts and feelings, enabling deeper conversations.\n\n#### Conclusion and limitations\n\nThis work successfully proposes the concept of\n\n**'Mixed-Session Conversation'** and the **MISC dataset** and **EMMA model** that enable it. The egocentric memory mechanism proves highly effective at maintaining continuity and coherence even in complex situations where the conversation partner changes.\n\nHowever, the current system is limited to one partner per session; the authors note that future work will explore more dynamic settings in which multiple partners participate simultaneously within a single session."
    }
  ]
}