Analysis of Mock Conversations Across Large Language Models

11 Sept 2025 (modified: 08 Oct 2025)Submitted to Agents4ScienceEveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language model, conversation, natural language processing
TL;DR: Systematic comparison of 4 LLMs via mock multi-turn conversations with NLP tools. Our framework enables reproducible, quantitative LLM conversation profiling beyond single turns.
Abstract: The rapid advancement of large language models (LLMs) has enabled increasingly sophisticated conversational agents, and systematic comparisons of their conversational behaviors are of great importance. In this study, we generated mock conversations between two people using four LLMs---ChatGPT Free version without account, Gemini-2.0-flash, GPT-5 thinking model, and Claude Opus 4.1---prompted to produce 30-turn interactions each. We quantitatively analyzed multiple conversation-level features, including structural metrics (e.g., number of turns, utterance length), lexical and linguistic properties (e.g., type-token ratio, noun/verb ratios, lexical alignment), sentiment and emotion, repetition and novelty, question-response patterns, speaker balance, and linguistic complexity measured via perplexity. Statistical tests (Kruskal-Wallis), feature importance analyses using random forests, and dimensionality reduction (PCA) were employed to identify discriminative features and uncover patterns across models. Results revealed that GPT-5 exhibited high novelty, lexical diversity, and complexity but shorter utterances, whereas ChatGPT Free produced longer, more positive utterances with higher question rates. Claude Opus 4.1 generated the longest conversations with balanced linguistic profiles, and Gemini-2.0-flash was generally intermediate. Our work provides a multi-dimensional understanding of AI conversational behavior within single-agent interactions. This can offer insights into model selection, fine-tuning, and the design of future human-AI dialogue systems.
Submission Number: 103
Loading