Abstract: Effective turn-taking is fundamental to conversational interactions, shaping the fluidity of communication across human dialogues and interactions with spoken dialogue systems (SDS). Despite its apparent simplicity, conversational turn-taking involves complex timing mechanisms influenced by various linguistic, prosodic, and multimodal cues. This review synthesises recent theoretical insights and practical advancements in understanding and modelling conversational timing dynamics, emphasising critical phenomena such as voice activity (VA), turn floor offsets (TFO), and predictive turn-taking. We first discuss foundational concepts, such as voice activity detection (VAD) and inter-pausal units (IPUs), and highlight their significance for systematically representing dialogue states. Central to the challenge of interactive systems is distinguishing moments when conversational roles shift versus when they remain with the current speaker, encapsulated by the concepts of “hold” and “shift”. The timing of these transitions, measured through Turn Floor Offsets (TFOs), aligns closely with minimal human reaction times, suggesting biological underpinnings while exhibiting cross-linguistic variability. This review further explores computational turn-taking heuristics and models, noting that simplistic strategies may reduce interruptions yet risk introducing unnatural delays. Integrating multimodal signals, prosodic, verbal, visual, and predictive mechanisms is emphasised as essential for future developments in achieving human-like conversational responsiveness.
External IDs:doi:10.3390/technologies13120591
Loading