Towards a Framework for Studying Alignment Drift in Multi-Turn Human–Robot Interaction

Published: 26 Feb 2026, Last Modified: 12 Mar 2026D-TUR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Human-Robot Interaction, Alignment Drift, Multi-Turn Dialogue, Large Language Models, Representation Engineering, AI Safety
TL;DR: We propose a framework for studying how alignment drifts over extended human–robot interaction by generating synthetic multi-turn conversations and analyzing temporal shifts in internal model representations.
Abstract: Foundation models are increasingly integrated into autonomous and social robots where behavior often emerges over extended voice-based human–robot interaction rather than from isolated, single-prompt instructions through text-based interfaces. Large Language Model (LLM) alignment refers to the aim of ensuring that models adhere to human preferences and restrictions, producing outputs that are helpful, honest and harmless. Many existing model alignment techniques are still primarily evaluated for LLMs in single-turn settings, in multi-turn adversarial contexts, or in text-based chatbot scenarios. These provide limited insight into how model alignment is sustained under realistic long-term human-robot interactions, where conversational history and social dynamics shape future model behavior. This paper examines current methodological challenges for studying alignment drift in long-term human-robot interactions. We argue that existing multi-turn evaluation techniques primarily focus on adversarial jailbreak scenarios, providing limited insight into how alignment may shift gradually through naturalistic, feedback-driven interaction. To address this gap, we propose a multi-agent framework for generating synthetic, HRI-focused multi-turn conversations. Using the resulting data, we aim to study alignment drift at the representation-level by modeling each interaction as a trajectory in activation space and analyzing turn-by-turn dynamics. By linking observable conversational behavior to internal representational changes, the approach aims to provide methods for improving transparency and interpretability in long-term human–robot interaction with generative models.
Submission Number: 3
Loading