Abstract: How do you compare two language models that score identically on benchmarks but behave very differently in conversation? One model may explore diverse topics fluidly, while another repeats familiar patterns.
We propose modeling multi-turn conversations as Markov chains over semantic states. By embedding conversation turns and clustering them into discrete states, we construct transition graphs that capture conversational dynamics beyond static performance metrics. From this structure, we derive three interpretable observables: entropy rate, spectral gap, and stationary distribution, corresponding respectively to behavioral diversity, responsiveness, and long-term conversational patterns.
We apply the framework to over 300,000 turns of teacher–student dialogue, comparing Llama 3.1 8B and Mistral 7B. Despite similar benchmark performance, the models exhibit distinct behavioral signatures: Llama produces more diverse responses and transitions more fluidly between semantic states, while Mistral concentrates probability mass on a narrower set of conversational behaviours.
Conversational Markov analysis provides a principled, model-agnostic tool for analysing how language models behave over time, complementing existing evaluation methods and enabling deeper insight into conversational dynamics.
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: Revised Figure 2 (Behavioral observables comparison) to address reviewer feedback:
1. **Annotation boxes repositioned** — delta-annotation boxes moved over the shorter bar in each subplot to eliminate overlap with bar tops.
2. **Error bars added to Peak State Concentration subplot** — bootstrap 95% confidence intervals (Llama [9.8%, 11.2%], Mistral [16.5%, 18.1%]) were inadvertently omitted from the original figure; now displayed consistently with the other three subplots and matching the intervals reported in Section 6.
No other changes to the manuscript.
Assigned Action Editor: ~Branislav_Kveton1
Submission Number: 6826
Loading