When humans speak, gestures help convey communicative intentions, such as adding emphasis or describing concepts. However, current co-speech gesture generation methods rely solely on superficial linguistic cues (e.g. speech audio or text transcripts), neglecting to understand and leverage the communicative intention that underpins human gestures. This results in outputs that are rhythmically synchronized with speech but are semantically shallow. To address this gap, we introduce Intentional-Gesture, a novel framework that casts gesture generation as an intention-reasoning task grounded in high-level communicative functions. First, we curate the InG dataset by augmenting BEAT-2 with gesture-intention annotations (i.e., text sentences summarizing intentions), which are automatically annotated using large vision-language models. Next, we introduce the Intentional Gesture Motion Tokenizer to leverage these intention annotations. It injects high-level communicative functions (e.g., intentions) into tokenized motion representations to enable intention-aware gesture synthesis. Our framework offers a modular foundation for expressive gesture generation in digital humans and embodied AI.
Intentional Gesture can generate various gestures based on the speech audio and intention, showcasing its potential for various applications in the development of digital humans and embodied agents.
Our results are shown on the left, and the results of compared methods are shown on the right.
Photorealistic video generation based on audio2photoreal rendering pipeline.
Ablation Study Comparison: Full version results are shown on the left, and the ablated results are shown on the right.
Replace Intentions with Motion Description: When we replace the intention annotations with motion descriptions, the model loses the high-level communicative context, resulting in less semantically meaningful gestures that are more focused on physical movement rather than communicative intent.
Semantic Suerpvision for the tokenizer helps to capture the emotional context from the speech and repsent the corresponding larger motion patterns to highlight some strong emotions.
Without Intention as Input, the model only relies on the audio beats, though most of the gestures follows the rhythms, the motion is not semantically meaningful and always looks redundent and unnatural.
Information about the dataset used in this research.
We segment videos into utterances and extract SMPL-based 3D motion. Within each utterance, motion trajectories are smoothed and segmented by direction/amplitude to form rule-based movement descriptors, anchored by keyframes. These motion cues, together with transcripts, feed a VLM prompting pipeline that produces communicative function labels and intention summaries. A human-in-the-loop stage filters candidates and finalizes annotations.
Speech is transcribed and aligned to the audio timeline so that transcript tokens provide temporally grounded queries for gesture understanding. Each clip bundles time-aligned transcripts, audio features, and 3D motion with finalized intention & function labels, enabling models to condition on rhythmic audio cues and explicit communicative semantics.
Two example clips demonstrating annotated motion analysis using rule-based movement descriptors and vlm prompting for the intention inference.
{
"motion_analysis": {
"head": "Neutral input; no observed head shake or nod reported.",
"hands_fingers": "No finger articulation available. Two hands are moving inward indicating a closed posture.",
"arms_shoulders": "Both arms are moving inward indicating a closed posture.",
"legs_feet": "Stable stance assumed; no stepping or weight shift described.",
"torso_whole_body": "Upright/neutral posture; emphasis is carried by phrasing rather than observed body movement."
},
"function_derivation": [
"Evaluation (negative): \"not very good\" expresses a negative assessment of community services.",
"Emphasis: The phrase \"not very\" intensifies the negative judgment.",
"Topic Framing: \"the community services\" establishes the evaluated entity."
],
"gesture_behavior_mapping": [
"Evaluation (negative) \u2192 A closed posture or reduced amplitude beats would align with negative appraisal.",
"Emphasis \u2192 Minimal beat accents could coincide with the stressed phrase \"not very good\".",
"Topic Framing \u2192 No deictic mapping asserted without visual evidence (no pointing/reference gesture claimed)."
],
"inferred_intention": {
"motion_based": "Insufficient motion evidence provided; the person is in a closed posture or reduced amplitude beats for their hand movements with negative appraisal.",
"summary": "Convey dissatisfaction with community services; the linguistic construction signals a clear negative evaluation without asserting unobserved gestures."
}
}
{
"motion_analysis": {
"head": "Brief forward nod on \"dangerous\"; otherwise steady orientation.",
"hands_fingers": "Right hand open-palm with relaxed fingers; small outward sweep during \"more often than not\"; left hand neutral.",
"arms_shoulders": "Right arm performs a short lateral sweep at mid-torso height; shoulders relaxed.",
"legs_feet": "Stable stance; no visible stepping or weight shift beyond slight forward bias.",
"torso_whole_body": "Upright posture with a subtle forward lean on the stressed word \"dangerous\"."
},
"function_derivation": [
"Emphasis: Stress on \"dangerous\" coincides with nod/lean.",
"Generalization/Frequency: \"more often than not\" paired with a broadening hand sweep."
],
"gesture_behavior_mapping": [
"Emphasis → Head nod and slight forward lean timed with \"dangerous\".",
"Generalization/Frequency → Open-palm outward sweep aligning with \"more often than not\"."
],
"inferred_intention": {
"motion_based": "Gestures highlight severity (nod/lean) and breadth/frequency (open-palm sweep) in lockstep with the spoken phrasing.",
"summary": "Underline that such situations occur frequently and carry risk; gestures emphasize seriousness and scope."
}
}
The format of a metadata JSON file is shown below (example from 2_scott_0_24_24.json):
{
"test_case": "2_scott_0_24_24.TextGrid",
"sequences": [
{
"sequence_timing": {
"start_time": 1.7,
"end_time": 4.01,
"duration": 2.31
},
"sentence": "there is one food that is quite tasty",
"word_timings": [
{
"word": "there",
"start_time": 1.7,
"end_time": 1.9,
"frame_index": 27,
"image_path": "BEAT_V2/beat_v2.0.0/smplx_render/english/2_scott_0_24_24/frame_27.png"
},
{
"word": "is",
"start_time": 1.9,
"end_time": 2.01,
"frame_index": 29,
"image_path": "BEAT_V2/beat_v2.0.0/smplx_render/english/2_scott_0_24_24/frame_29.png"
}
],
"motion_analysis": {
"head": "Slight forward tilt as the speaker emphasizes \"tasty,\" indicating engagement and interest.",
"hands_fingers": "Right hand held in front, fingers extended and slightly bent, conveying indication/description.",
"arms_shoulders": "Right arm slightly raised at shoulder height; left arm relaxed.",
"legs_feet": "Weight evenly distributed with a slight forward lean.",
"torso_whole_body": "Upper body leans slightly forward; posture is open and engaged."
},
"function_derivation": [
"Deixis: \"There\" indicates a reference to something specific.",
"Quantification: \"One\" suggests a singular item within a larger set."
],
"gesture_behavior_mapping": [
"Deixis → Pointing gesture: Extended hand aligns with indicating a specific item.",
"Quantification → Numerical gesture: Hand positioning denotes singularity (one)."
],
"inferred_intention": {
"motion_based": "Gestures draw attention to a specific item; forward lean and arm positioning show engagement.",
"summary": "Highlight a particular food item and its appeal, inviting anticipatory engagement."
}
}
]
}
Intentional Gesture, 2025