Intentional Gesture: Deliver your Intentions with Gestures for Speech

We present Intentioanl Gesture, a novel framework for intention-controllable gesture generation. Our method models latent communicative functions from speech and grounds motion generation in these inferred intentions.

Abstract

When humans speak, gestures help convey communicative intentions, such as adding emphasis or describing concepts. However, current co-speech gesture generation methods rely solely on superficial linguistic cues (e.g. speech audio or text transcripts), neglecting to understand and leverage the communicative intention that underpins human gestures. This results in outputs that are rhythmically synchronized with speech but are semantically shallow. To address this gap, we introduce Intentional-Gesture, a novel framework that casts gesture generation as an intention-reasoning task grounded in high-level communicative functions. First, we curate the InG dataset by augmenting BEAT-2 with gesture-intention annotations (i.e., text sentences summarizing intentions), which are automatically annotated using large vision-language models. Next, we introduce the Intentional Gesture Motion Tokenizer to leverage these intention annotations. It injects high-level communicative functions (e.g., intentions) into tokenized motion representations to enable intention-aware gesture synthesis. Our framework offers a modular foundation for expressive gesture generation in digital humans and embodied AI.

Method

Left: AuMoCLIP learns a hierarchical joint embedding of motion, audio, and intention. Transcript embeddings (BERT) aligned via CTC serve as queries in a cross-attention module with intention embeddings as keys/values. The resulting semantic features are concatenated with wav2vec2 audio features for contrastive learning. Right: Motion is quantized via a multi-codebook VQ module and supervised by semantic features from AuMoCLIP, enabling expressive and controllable gesture generation.

Video Results

Intentional Gesture can generate various gestures based on the speech audio and intention, showcasing its potential for various applications in the development of digital humans and embodied agents.

Example 1
Example 2
Example 3
Example 4

Comparison with SOTA Methods

Our results are shown on the left, and the results of compared methods are shown on the right.

Realistic Video Rendering

Photorealistic video generation based on audio2photoreal rendering pipeline.

Photorealistic Rendering 1
Photorealistic Rendering 2
Photorealistic Rendering 3

Ablation Study

Ablation Study Comparison: Full version results are shown on the left, and the ablated results are shown on the right.

Dataset Information

Information about the dataset used in this research.

Basic Information

  • Built on top of BEAT-2 and Audio2Photoreal, high-quality co-speech gesture corpus, and augmented with Intention-Grounded (InG) annotations (communicative functions + intention summaries).
  • Each utterance is paired with motion-grounded descriptors (keyframes + rule-based movement summaries) and intention text derived via a structured VLM prompting protocol with human filtering.
  • Modalities include audio, time-aligned transcripts, 3D body motion (SMPL joints & hands), and intention/function labels.
  • Annotations target pragmatic functions (e.g., Emphasis, Deixis, Negation, Mental State, Process) to enable intention-controllable gesture generation.

Basic Statistics

  • 34,641 / 3,598 / 9,674 annotated utterances for train / val / test (InG).
  • 16 communicative function types (top: Emphasis ≈21.7%, Deixis ≈20.1%).
  • Source BEAT-2: ~60 hours, 25 speakers, 1,762 sequences (avg ≈65.7 s).
  • Human preference/validation study: inter-rater agreement κ ≈ 0.76 on a balanced subset.

Data Processing Pipeline

We segment videos into utterances and extract SMPL-based 3D motion. Within each utterance, motion trajectories are smoothed and segmented by direction/amplitude to form rule-based movement descriptors, anchored by keyframes. These motion cues, together with transcripts, feed a VLM prompting pipeline that produces communicative function labels and intention summaries. A human-in-the-loop stage filters candidates and finalizes annotations.

Audio Separation and Alignment

Speech is transcribed and aligned to the audio timeline so that transcript tokens provide temporally grounded queries for gesture understanding. Each clip bundles time-aligned transcripts, audio features, and 3D motion with finalized intention & function labels, enabling models to condition on rhythmic audio cues and explicit communicative semantics.

VLM Prompting Steps

  1. Input Assembly
    • Inputs: transcript snippet, utterance timestamps, motion keyframes, and rule-based movement descriptors.
    • Goal: provide the VLM with synchronized text + motion evidence for the current utterance window.
  2. Step 1 — Motion Analysis
    • The VLM describes salient body/hand movements (direction, extent, rhythm) grounded to provided keyframes/descriptors.
    • Output: structured motion summary (e.g., “right hand lifts outward; periodic wrist oscillation synced to stressed words”).
  3. Step 2 — Communicative Function Derivation
    • From the motion summary + transcript context, the VLM selects one or more communicative functions (e.g., Emphasis, Deixis, Contrast, Negation) with brief rationales.
    • Output: function labels with confidence and justification.
  4. Step 3 — Gesture Behavior Mapping
    • The VLM maps functions to prototypical gesture behaviors (e.g., “pointing toward referent” for Deixis), aligned to timestamps.
    • Output: behavior slots (phase onsets/offsets) and coarse spatial descriptors linked to the utterance timeline.
  5. Step 4 — Intention Inference
    • The VLM produces a concise intention summary that explains what the speaker aims to convey nonverbally.
    • Output: 1–2 sentence intention text designed to condition downstream encoders/tokenizers.
  6. Candidate Generation & Human Filtering
    • For each utterance, the VLM generates up to 5 candidates (diverse sampling); annotators review and select the best.
    • Quality checks: label consistency, timestamp alignment, and motion–text agreement; disagreements are resolved via majority vote.
  7. Packaging
    • Final artifacts per utterance: transcript, audio timestamps, SMPL motion, function labels, and intention summary with provenance (prompt version, model ID, and human reviewer ID).

Annotation Visualization

Two example clips demonstrating annotated motion analysis using rule-based movement descriptors and vlm prompting for the intention inference.

Annotation Example 1
Annotation (JSON)
            {
              "motion_analysis": {
                "head": "Neutral input; no observed head shake or nod reported.",
                "hands_fingers": "No finger articulation available. Two hands are moving inward indicating a closed posture.",
                "arms_shoulders": "Both arms are moving inward indicating a closed posture.",
                "legs_feet": "Stable stance assumed; no stepping or weight shift described.",
                "torso_whole_body": "Upright/neutral posture; emphasis is carried by phrasing rather than observed body movement."
              },
              "function_derivation": [
                "Evaluation (negative): \"not very good\" expresses a negative assessment of community services.",
                "Emphasis: The phrase \"not very\" intensifies the negative judgment.",
                "Topic Framing: \"the community services\" establishes the evaluated entity."
              ],
              "gesture_behavior_mapping": [
                "Evaluation (negative) \u2192 A closed posture or reduced amplitude beats would align with negative appraisal.",
                "Emphasis \u2192 Minimal beat accents could coincide with the stressed phrase \"not very good\".",
                "Topic Framing \u2192 No deictic mapping asserted without visual evidence (no pointing/reference gesture claimed)."
              ],
              "inferred_intention": {
                "motion_based": "Insufficient motion evidence provided; the person is in a closed posture or reduced amplitude beats for their hand movements with negative appraisal.",
                "summary": "Convey dissatisfaction with community services; the linguistic construction signals a clear negative evaluation without asserting unobserved gestures."
              }
            }
              
Annotation Example 2
Annotation (JSON)
            {
              "motion_analysis": {
                "head": "Brief forward nod on \"dangerous\"; otherwise steady orientation.",
                "hands_fingers": "Right hand open-palm with relaxed fingers; small outward sweep during \"more often than not\"; left hand neutral.",
                "arms_shoulders": "Right arm performs a short lateral sweep at mid-torso height; shoulders relaxed.",
                "legs_feet": "Stable stance; no visible stepping or weight shift beyond slight forward bias.",
                "torso_whole_body": "Upright posture with a subtle forward lean on the stressed word \"dangerous\"."
              },
              "function_derivation": [
                "Emphasis: Stress on \"dangerous\" coincides with nod/lean.",
                "Generalization/Frequency: \"more often than not\" paired with a broadening hand sweep."
              ],
              "gesture_behavior_mapping": [
                "Emphasis → Head nod and slight forward lean timed with \"dangerous\".",
                "Generalization/Frequency → Open-palm outward sweep aligning with \"more often than not\"."
              ],
              "inferred_intention": {
                "motion_based": "Gestures highlight severity (nod/lean) and breadth/frequency (open-palm sweep) in lockstep with the spoken phrasing.",
                "summary": "Underline that such situations occur frequently and carry risk; gestures emphasize seriousness and scope."
              }
            }
              

Example Annotation Format

The format of a metadata JSON file is shown below (example from 2_scott_0_24_24.json):

{
  "test_case": "2_scott_0_24_24.TextGrid",
  "sequences": [
    {
      "sequence_timing": {
        "start_time": 1.7,
        "end_time": 4.01,
        "duration": 2.31
      },
      "sentence": "there is one food that is quite tasty",
      "word_timings": [
        {
          "word": "there",
          "start_time": 1.7,
          "end_time": 1.9,
          "frame_index": 27,
          "image_path": "BEAT_V2/beat_v2.0.0/smplx_render/english/2_scott_0_24_24/frame_27.png"
        },
        {
          "word": "is",
          "start_time": 1.9,
          "end_time": 2.01,
          "frame_index": 29,
          "image_path": "BEAT_V2/beat_v2.0.0/smplx_render/english/2_scott_0_24_24/frame_29.png"
        }
      ],
      "motion_analysis": {
        "head": "Slight forward tilt as the speaker emphasizes \"tasty,\" indicating engagement and interest.",
        "hands_fingers": "Right hand held in front, fingers extended and slightly bent, conveying indication/description.",
        "arms_shoulders": "Right arm slightly raised at shoulder height; left arm relaxed.",
        "legs_feet": "Weight evenly distributed with a slight forward lean.",
        "torso_whole_body": "Upper body leans slightly forward; posture is open and engaged."
      },
      "function_derivation": [
        "Deixis: \"There\" indicates a reference to something specific.",
        "Quantification: \"One\" suggests a singular item within a larger set."
      ],
      "gesture_behavior_mapping": [
        "Deixis → Pointing gesture: Extended hand aligns with indicating a specific item.",
        "Quantification → Numerical gesture: Hand positioning denotes singularity (one)."
      ],
      "inferred_intention": {
        "motion_based": "Gestures draw attention to a specific item; forward lean and arm positioning show engagement.",
        "summary": "Highlight a particular food item and its appeal, inviting anticipatory engagement."
      }
    }
  ]
}

BibTeX


      Intentional Gesture, 2025