[
    {
        "id": "0",
        "video": "MER2023/videos/sample_00001998.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Mom, it's really uncomfortable. Hurry up, get on with it. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "1",
        "video": "MER2023/videos/sample_00003564.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: What's the trouble with this? You're helping the little dog! I am also glad that you can do so. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "2",
        "video": "MER2023/videos/sample_00003845.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Okay, anyway, you can just bring it back. As a little undercover agent, I can be considered as helping you climb the Blade Mountain. Next. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "3",
        "video": "MER2023/videos/sample_00000733.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: They all lack the ability to analyze independently, always feel the need to add some external, additional elements. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "4",
        "video": "MER2023/videos/sample_00000282.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: You know, how many years have I waited for you to appear. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "5",
        "video": "MER2023/videos/sample_00005891.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I will work together in the future,, I can't talk about anything. Please take care of me. If you don't know anything, feel free to come, ask me at any time. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "6",
        "video": "MER2023/videos/sample_00001444.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: You come down, you see, I'm not reliable! You contacted Gu Qing, what? Breaking the show. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "7",
        "video": "MER2023/videos/sample_00000820.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Can he tell me? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "8",
        "video": "MER2023/videos/sample_00000905.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Can you ignore it? He came to you to make trouble? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "9",
        "video": "MER2023/videos/sample_00000017.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: It's nothing, so polite. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "10",
        "video": "MER2023/videos/sample_00007005.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: The answer is: Why? When I'm in a good mood, discuss it. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "11",
        "video": "MER2023/videos/sample_00005103.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: It is also certain that your taste problem is actually a psychological issue. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "12",
        "video": "MER2023/videos/sample_00005186.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I think the act of marrying him is a bit excessive,, you should take it seriously. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "13",
        "video": "MER2023/videos/sample_00005947.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: What does it have to do with abandoning me for three whole years? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "14",
        "video": "MER2023/videos/sample_00003194.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: It's not convenient to personally negotiate with them. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "15",
        "video": "MER2023/videos/sample_00006841.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Because it's my buddy, I can't take this money. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "16",
        "video": "MER2023/videos/sample_00006169.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I am left with Fang Da, but as soon as he leaves, the family is gone. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "17",
        "video": "MER2023/videos/sample_00000750.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I don't know You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "18",
        "video": "MER2023/videos/sample_00004817.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Fuck you! No one is of use. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "19",
        "video": "MER2023/videos/sample_00004702.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: The problem is, on Chen Jian's side, it's already so obvious,, there's more. No. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "20",
        "video": "MER2023/videos/sample_00006690.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: He wants to leave us both behind, he wants to fly away on his own. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "21",
        "video": "MER2023/videos/sample_00003039.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: He was not stimulated before. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "22",
        "video": "MER2023/videos/sample_00003364.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Think about it, they are all very happy. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "23",
        "video": "MER2023/videos/sample_00005754.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Why don't you bother me? Can you help me explain to Brother Wei. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "24",
        "video": "MER2023/videos/sample_00006658.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Pinching your nose, I can't tell it's you, you upscale, highly skilled photographer. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "25",
        "video": "MER2023/videos/sample_00004889.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: So whose daughter is paralyzed? Is that certain? That would definitely cost me eight lifetimes of blood mold. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "26",
        "video": "MER2023/videos/sample_00006444.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Do you mean, I killed Shen Wei? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "27",
        "video": "MER2023/videos/sample_00000360.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Ah. Ah. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "surprise"
            }
        ]
    },
    {
        "id": "28",
        "video": "MER2023/videos/sample_00002135.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Being able to handle weddings diligently, frugally,, knowing how to live a life. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "29",
        "video": "MER2023/videos/sample_00005633.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Classmate, do you have the wrong person. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "30",
        "video": "MER2023/videos/sample_00004917.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Interview, interview you? Hey, if it weren't for choosing a job, it would be better to hit the sun today. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "31",
        "video": "MER2023/videos/sample_00003319.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: You see, I finally ran over, you took it. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "32",
        "video": "MER2023/videos/sample_00006248.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Okay! Absolutely not! I am your father. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "33",
        "video": "MER2023/videos/sample_00005400.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Don't bring such a dirty wife with you. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "34",
        "video": "MER2023/videos/sample_00005065.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Mom, are you going to take care of me for this small matter? You: Isn't it? Dear? Mom? Isn't it just one piece of clothing? You. Thought. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "35",
        "video": "MER2023/videos/sample_00001519.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Me: Did I tell you earlier? Next time, definitely. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "36",
        "video": "MER2023/videos/sample_00003805.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: You go, go, go, go. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "37",
        "video": "MER2023/videos/sample_00004886.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: My brother's house was bought by my parents with money,, they were worried that my sister-in-law would get divorced, divide the house. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "38",
        "video": "MER2023/videos/sample_00004305.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Me: I heard that last night, you lived with Chen Jun. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "39",
        "video": "MER2023/videos/sample_00002400.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I really can't let you go in my heart. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "40",
        "video": "MER2023/videos/sample_00003764.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: No, Dad, don't worry. When I listen to you, your words, tone have changed. Don't be fooled by yourself, there won't be anything wrong. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "41",
        "video": "MER2023/videos/sample_00006456.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: What time should I give it? Give him a call, ask him. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "42",
        "video": "MER2023/videos/sample_00001259.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: It's all people like us who are forced to donate to you, to put it mildly, people like you. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "43",
        "video": "MER2023/videos/sample_00003204.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: But you can't just find a man casually. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "44",
        "video": "MER2023/videos/sample_00006667.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Okay. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "45",
        "video": "MER2023/videos/sample_00003174.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Also, what do you think of Zhao Chong in our family? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "46",
        "video": "MER2023/videos/sample_00004469.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: No more than this hand. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "47",
        "video": "MER2023/videos/sample_00001637.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Get it! The trains traveling from south to north transported all the stones of my comrades to every corner of the country. I don't know where they went, but I suddenly remembered the stones that were not piled up in major factories across the province in 1999. Do you remember? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "48",
        "video": "MER2023/videos/sample_00006932.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Thank you, thank you for not seeing me outside. Actually, before that, I thought I had only Lao Tan as my friend in this lifetime, but I didn't expect to be in the Ersha Tower. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "49",
        "video": "MER2023/videos/sample_00001493.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I knew you would like it. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "50",
        "video": "MER2023/videos/sample_00004210.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I know where you dialed it. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "51",
        "video": "MER2023/videos/sample_00000714.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Also because last time we asked me to write a written inspection, this inspection may be. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "52",
        "video": "MER2023/videos/sample_00000848.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Who else can there be? You are the only one for me. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "53",
        "video": "MER2023/videos/sample_00004845.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Okay, okay, I'm not getting married. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "54",
        "video": "MER2023/videos/sample_00002044.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Where did you buy this for you? This is a gift for you. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "55",
        "video": "MER2023/videos/sample_00002047.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Sure enough, your approach to this matter today is particularly inappropriate. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "56",
        "video": "MER2023/videos/sample_00006215.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: The receptionist at the front desk was sent by my supervisor, Teacher Xiao Pang. Since I am standing there, I will. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "57",
        "video": "MER2023/videos/sample_00000436.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: He said, 'When I'm bored, I feel like crying.'. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "58",
        "video": "MER2023/videos/sample_00003168.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Sending books is not enough. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "59",
        "video": "MER2023/videos/sample_00000963.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: You. really, that man? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "60",
        "video": "MER2023/videos/sample_00000696.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: You ask him, has he ever recognized his father? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "61",
        "video": "MER2023/videos/sample_00003126.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I know who he is. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "62",
        "video": "MER2023/videos/sample_00002738.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: How can Shen Qingchuan play cat tricks? Why can't I just? Why am I so obedient. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "63",
        "video": "MER2023/videos/sample_00003738.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Okay? If there are any, you need to quickly entrust, see who can help with the transfer in your hometown. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "64",
        "video": "MER2023/videos/sample_00005791.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: What do you mean, you want to teach us? Fruit. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "65",
        "video": "MER2023/videos/sample_00005235.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: nan You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "66",
        "video": "MER2023/videos/sample_00000691.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Don't be foolish, you. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "67",
        "video": "MER2023/videos/sample_00001501.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: If you love me, just hit the probe! Ah! Me: Is this still home? Me: This is a prison. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "68",
        "video": "MER2023/videos/sample_00001613.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Do you believe you want to talk to me like this again? I will die, show you now. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "69",
        "video": "MER2023/videos/sample_00000489.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Yes. Yes. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "70",
        "video": "MER2023/videos/sample_00005886.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: nan You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "71",
        "video": "MER2023/videos/sample_00004988.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Recently, hospitals are filled with flu patients everywhere. It's the flu season now,, if my mother catches a cold, I will feel heartbroken. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "72",
        "video": "MER2023/videos/sample_00001044.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Me: Bao: Bao: Ah, you forgot. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "73",
        "video": "MER2023/videos/sample_00002440.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: However, after all, Andy, you are not allowed to let Dr. Zhao go. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "74",
        "video": "MER2023/videos/sample_00000463.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: You are like a stone in a cesspit. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "75",
        "video": "MER2023/videos/sample_00004705.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I'm at work,, something big is happening again. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "76",
        "video": "MER2023/videos/sample_00006600.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: San Ge, where is there a partner? Who did you listen to? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "surprise"
            }
        ]
    },
    {
        "id": "77",
        "video": "MER2023/videos/sample_00000193.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Tomorrow, we will return to Shanghai. You must not have any psychological pressure, especially. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "78",
        "video": "MER2023/videos/sample_00001764.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: You taught it all, me: What if it's powerful? That. You. Just. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "79",
        "video": "MER2023/videos/sample_00001910.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: That day, day, special. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "80",
        "video": "MER2023/videos/sample_00004866.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: But. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "81",
        "video": "MER2023/videos/sample_00006996.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: It's really a big two, not even listening to me. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "82",
        "video": "MER2023/videos/sample_00001726.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: nan You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "83",
        "video": "MER2023/videos/sample_00000372.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Blow it down! Sister Zhao, did you treat me like a fart? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "84",
        "video": "MER2023/videos/sample_00005047.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I rejected him, but Xiaoxiong actually went behind my back to sign a contract with him. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "85",
        "video": "MER2023/videos/sample_00002068.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Mom said, 'You have the ability to write a new song. I made a song for you to write.'. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "86",
        "video": "MER2023/videos/sample_00005556.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Neighbor polite: What is it? Isn't it just two more pairs of chopsticks? Looking back: Ah, let me tell your aunt, add two more dishes. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "87",
        "video": "MER2023/videos/sample_00003875.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: To be honest, in this lifetime, my sister-in-law is the most grateful person. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "88",
        "video": "MER2023/videos/sample_00001758.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Household care, you, say: you see, you, can't even keep an eye on you. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "89",
        "video": "MER2023/videos/sample_00005522.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: You've been picking on me more than once. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "90",
        "video": "MER2023/videos/sample_00004234.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: It's about living a particularly exciting life. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "91",
        "video": "MER2023/videos/sample_00003720.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Okay, do you sell coffee beans there? Just right, I want to buy some coffee. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "92",
        "video": "MER2023/videos/sample_00002646.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Because, I am willing to sacrifice my life, become a living, dead person. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "93",
        "video": "MER2023/videos/sample_00006057.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: We: Wei Wei, that bad man. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "94",
        "video": "MER2023/videos/sample_00006763.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Me: As a substitute for Zhou Bo? Me: No. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "95",
        "video": "MER2023/videos/sample_00005903.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: How do you know? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "96",
        "video": "MER2023/videos/sample_00003275.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Han Shi: A woman, if she really can't hold back in the car, it's not suitable. You can let her in. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "97",
        "video": "MER2023/videos/sample_00001787.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: No, Chen Jun, why are you so stingy? This. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "98",
        "video": "MER2023/videos/sample_00003488.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Don't you just take action, do you have that stinky money? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "99",
        "video": "MER2023/videos/sample_00005895.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: On June 11th, it's Shen Qiang, my boyfriend. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "100",
        "video": "MER2023/videos/sample_00000151.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Uncle, let's consider this matter carefully, even if you restart. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "101",
        "video": "MER2023/videos/sample_00000014.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Because you have already left, you should be very clear about it. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "102",
        "video": "MER2023/videos/sample_00005273.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Don't worry, I will never make any further demands on site in the future. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "103",
        "video": "MER2023/videos/sample_00002034.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Me: Didn't you say everything? Coffee, I'll deliver it to you then. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "104",
        "video": "MER2023/videos/sample_00003390.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: That's fine. I'll give that one now, give them a call,, ask. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "105",
        "video": "MER2023/videos/sample_00005260.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Okay, if there's anything else. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "106",
        "video": "MER2023/videos/sample_00005262.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Can I go to that occasion? I will be nervous. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "107",
        "video": "MER2023/videos/sample_00001002.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: A blind battle. What? Ah? Over. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "108",
        "video": "MER2023/videos/sample_00002516.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: But a painting that I dare not look at. Always. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "109",
        "video": "MER2023/videos/sample_00004285.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Sister Han, I really put the white scumbag man in his den. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "surprise"
            }
        ]
    },
    {
        "id": "110",
        "video": "MER2023/videos/sample_00001534.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: No, who, I don't know, you, you, are we a force? Bitter. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "111",
        "video": "MER2023/videos/sample_00001673.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Shame! I want your friends, colleagues to know about you. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "112",
        "video": "MER2023/videos/sample_00004807.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: That's a big Korean brand. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "113",
        "video": "MER2023/videos/sample_00005086.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: What happened? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "114",
        "video": "MER2023/videos/sample_00004280.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Is it interesting? Zhang Mengqing. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "115",
        "video": "MER2023/videos/sample_00001221.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I finally understand why my father always wears sunglasses. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "116",
        "video": "MER2023/videos/sample_00000288.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: That. You. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "117",
        "video": "MER2023/videos/sample_00000998.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: You say, how did I mess up so badly? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "118",
        "video": "MER2023/videos/sample_00004764.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Me. Not. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "119",
        "video": "MER2023/videos/sample_00003088.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Your audio company should. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "120",
        "video": "MER2023/videos/sample_00002061.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: 4、 Brother Jin. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "121",
        "video": "MER2023/videos/sample_00001131.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Come on. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "122",
        "video": "MER2023/videos/sample_00001584.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: No, parents. Tell me, when I was a Mazu, I didn't leave anything behind. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "123",
        "video": "MER2023/videos/sample_00006219.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: This is the legendary journey of saying 'go, go'. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "124",
        "video": "MER2023/videos/sample_00002037.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Feier. Feier. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "125",
        "video": "MER2023/videos/sample_00001778.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: It's none of your business! You first. First. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "126",
        "video": "MER2023/videos/sample_00007015.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Today, are you hungry. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "127",
        "video": "MER2023/videos/sample_00004560.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: History has always been a drama, with ten years of youth, ten years of old, weak. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "128",
        "video": "MER2023/videos/sample_00006167.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: It's not what you think, I'm not proud at all, you're not compatible, unlucky old man, I won't let you. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "129",
        "video": "MER2023/videos/sample_00002373.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: No, it's your father, two. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "130",
        "video": "MER2023/videos/sample_00001649.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Mmm. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "131",
        "video": "MER2023/videos/sample_00003737.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Everything you wear looks good. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "132",
        "video": "MER2023/videos/sample_00003483.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Definitely! Definitely! You also need to take care of yourself! Okay! Okay! Okay. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "133",
        "video": "MER2023/videos/sample_00005784.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Please, don't use my fingers, will you? Me. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "134",
        "video": "MER2023/videos/sample_00005460.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Just rest carefully. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "135",
        "video": "MER2023/videos/sample_00003956.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I hate this kind of scam the most. If there's anything, you know it. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "136",
        "video": "MER2023/videos/sample_00001136.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Instead of thinking about everything there, it's better to step forward, gently hug, so that you can know the most authentic feelings in your heart. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "137",
        "video": "MER2023/videos/sample_00006730.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Boss, don't brag. I'll come, pick up this car tomorrow. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "138",
        "video": "MER2023/videos/sample_00000708.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: But I'm not very good at it. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "139",
        "video": "MER2023/videos/sample_00006805.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Yuan Hao: Ah, with his girlfriend. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "140",
        "video": "MER2023/videos/sample_00003574.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Knowing one's intentions is difficult, but I am indeed afraid that you will be bullied. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "141",
        "video": "MER2023/videos/sample_00000780.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Transfer all this information to our sales manager. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "142",
        "video": "MER2023/videos/sample_00003453.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I thought you would be moved by him, so I agreed. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "143",
        "video": "MER2023/videos/sample_00005475.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: He wants me to attend their concert before I get married. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "144",
        "video": "MER2023/videos/sample_00001902.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Thank you! Can I ask where you found my resume? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "145",
        "video": "MER2023/videos/sample_00003944.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Where the hell is he? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "146",
        "video": "MER2023/videos/sample_00002226.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: They even invited Zhou Bo before, but he couldn't reschedule. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "147",
        "video": "MER2023/videos/sample_00004851.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I can understand the feeling in your heart. In fact, if it were for me, I might have felt it for a while. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "148",
        "video": "MER2023/videos/sample_00005210.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Then you need a phone number from Mr. Bao. I: There is a project in Nantong City. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "149",
        "video": "MER2023/videos/sample_00006685.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Are you calling this bribery? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "150",
        "video": "MER2023/videos/sample_00003450.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: If I'm wrong about what I've done, it's also my fault. The reason he didn't tell you was because he didn't want to hurt you. Go back. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "151",
        "video": "MER2023/videos/sample_00006369.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: What kind of clothes do I make to look good? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "152",
        "video": "MER2023/videos/sample_00000866.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Me: This car accident will cost a lot of money? Me: I don't have anything valuable anymore. Me: Just that car. Could you please, Dr. Cheng, help me see if you can sell that car. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "153",
        "video": "MER2023/videos/sample_00001895.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: There is no Black Mountain Alliance. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "154",
        "video": "MER2023/videos/sample_00001916.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: If I could have obtained such a job before I met you, I wouldn't hesitate. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "155",
        "video": "MER2023/videos/sample_00002203.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Ah. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "156",
        "video": "MER2023/videos/sample_00005461.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Not moving week, this time. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "157",
        "video": "MER2023/videos/sample_00004083.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Me. Don't go back. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "158",
        "video": "MER2023/videos/sample_00000734.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I am looking for humanities. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "159",
        "video": "MER2023/videos/sample_00006090.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I never told you before. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "160",
        "video": "MER2023/videos/sample_00004338.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: What? People? Where? This is menopause, with illness. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "161",
        "video": "MER2023/videos/sample_00005763.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Me: These people are already in their thirties, I think, hurry up. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "162",
        "video": "MER2023/videos/sample_00004392.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Not happy, then the two of them started fighting. That's all it takes. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "163",
        "video": "MER2023/videos/sample_00003929.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Saying: After eleven years, I still don't understand why my son was lost? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "164",
        "video": "MER2023/videos/sample_00004909.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: If you're all dead, I'll hate you for the rest of my life! I'm not worth it for him! It's really not worth it. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "165",
        "video": "MER2023/videos/sample_00001907.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: What about the words? He said it. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "166",
        "video": "MER2023/videos/sample_00002746.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Don't be pitiful, I'm recovering well. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "167",
        "video": "MER2023/videos/sample_00002472.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Mom, I: Both rooms have just been cleaned up. One room, half an hour, two rooms, one hour. Jin: Jump, you can play with the electricity for an extra hour. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "168",
        "video": "MER2023/videos/sample_00001937.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: When you arrive, big brother, at my age, you can do whatever kind of woman or thing you want. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "169",
        "video": "MER2023/videos/sample_00004789.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Can you stop thinking about others based on your own low-level outlook on life, values? I think that's your idea. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "170",
        "video": "MER2023/videos/sample_00001399.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: If you don't make up a perfect lie, I won't help you. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "171",
        "video": "MER2023/videos/sample_00004978.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: After this matter. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "172",
        "video": "MER2023/videos/sample_00002193.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I have decided. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "173",
        "video": "MER2023/videos/sample_00000748.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: It's rare to eat meat a few times. My dad always says he doesn't like meat, but now he realizes that he wants to leave meat for me, Xiu. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "174",
        "video": "MER2023/videos/sample_00004897.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: It is precisely because of his reverse nature, independent choice that Xiao Hong is destined to be immortal, famous, lonely,, deeply involved in future affairs. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "175",
        "video": "MER2023/videos/sample_00000277.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: This is my home. You. You. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "176",
        "video": "MER2023/videos/sample_00002897.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I tell you, Hu Qixing. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "177",
        "video": "MER2023/videos/sample_00004046.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Oh, then I won't leave, you also need to let Peng Jiahe know. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "178",
        "video": "MER2023/videos/sample_00001019.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: That. Not as good as. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "179",
        "video": "MER2023/videos/sample_00003089.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Sure enough, didn't you sleep? Peach: What about it? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "180",
        "video": "MER2023/videos/sample_00005276.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I tell you, the old man has left,, I am also sad, but what makes me sad for her is that she has such an unfilial person as you. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "181",
        "video": "MER2023/videos/sample_00006632.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: You know, when we listen to someone, don't listen to what they say, but rather listen to what they haven't said. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "182",
        "video": "MER2023/videos/sample_00002898.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Getting fired? So I fired him, he hates me, comes to retaliate against me, is it me? Just go kill him? How many people do you have to kill, Luo Ji? I won't be a serial killer anymore. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "183",
        "video": "MER2023/videos/sample_00002959.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: It was a divorce agreement with my dad, just my dad's silly hat. He still had a deep affection for him,, as a result, he didn't die until he returned home. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "184",
        "video": "MER2023/videos/sample_00000491.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: He is a good person. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "185",
        "video": "MER2023/videos/sample_00001071.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Well, I'll also cause you some trouble. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "186",
        "video": "MER2023/videos/sample_00002531.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I have the courage to come over, dance with you directly. As a result, you see, I have formed a combination, danced for four years at a time. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "187",
        "video": "MER2023/videos/sample_00001459.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: nan You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "188",
        "video": "MER2023/videos/sample_00005822.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Ahan. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "surprise"
            }
        ]
    },
    {
        "id": "189",
        "video": "MER2023/videos/sample_00002090.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Vivi, can we stop this child for now? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "190",
        "video": "MER2023/videos/sample_00005343.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: We may not be able to get up today. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "191",
        "video": "MER2023/videos/sample_00002305.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: In your heart, you are really a particularly kind person. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "192",
        "video": "MER2023/videos/sample_00003467.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Me. Me. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "193",
        "video": "MER2023/videos/sample_00000941.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: If we don't go back, how can we report to the company? How to talk to you. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "194",
        "video": "MER2023/videos/sample_00004761.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I admit, the reason I hope everyone comes is to let everyone know that you are with me now. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "195",
        "video": "MER2023/videos/sample_00000583.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Let's go, so fast. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "surprise"
            }
        ]
    },
    {
        "id": "196",
        "video": "MER2023/videos/sample_00003781.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Me. Me. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "197",
        "video": "MER2023/videos/sample_00006252.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Don't be polite to me, you just treat this place as your own home. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "198",
        "video": "MER2023/videos/sample_00001179.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: What's going on? You used to be nothing. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "199",
        "video": "MER2023/videos/sample_00005686.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Alas, your business is called Tang Qianyan. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "200",
        "video": "MER2023/videos/sample_00003613.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Auntie: Here. However. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "201",
        "video": "MER2023/videos/sample_00000449.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: It seems like she's really with a man. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "202",
        "video": "MER2023/videos/sample_00004059.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Okay. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "203",
        "video": "MER2023/videos/sample_00005124.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Please help, it's just right! You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "204",
        "video": "MER2023/videos/sample_00004237.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: But if we just forgive him like this, he doesn't think it's a big deal,, if he encounters the same situation again, he will make mistakes again. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "205",
        "video": "MER2023/videos/sample_00005141.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Let me not take it to heart, it's okay. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "206",
        "video": "MER2023/videos/sample_00006466.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Could you please leave me? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "207",
        "video": "MER2023/videos/sample_00004340.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Humph. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "208",
        "video": "MER2023/videos/sample_00001721.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Zhao: You. You. You. You're too brave! You. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "209",
        "video": "MER2023/videos/sample_00000060.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: My hallucinations. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "210",
        "video": "MER2023/videos/sample_00005399.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Ah. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "surprise"
            }
        ]
    },
    {
        "id": "211",
        "video": "MER2023/videos/sample_00004188.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I don't know, because it may involve human life. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "212",
        "video": "MER2023/videos/sample_00002512.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Is everything in this house running out of oil. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "213",
        "video": "MER2023/videos/sample_00005304.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Li Jun: Committee. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "214",
        "video": "MER2023/videos/sample_00006338.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Do you think Cai Weilan is deep? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "215",
        "video": "MER2023/videos/sample_00004202.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I just landed, pushed someone else's French feast, so I'll go with you to have a small hot pot for 28 yuan. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "216",
        "video": "MER2023/videos/sample_00006324.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Apple: Knight: Suspected of excluding, insulting China when dealing with the Chinese market. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "217",
        "video": "MER2023/videos/sample_00006643.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Friends, friends, help friends. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "218",
        "video": "MER2023/videos/sample_00004928.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Guangyao: Ah, were you on the phone just now? That attitude is quite good. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "219",
        "video": "MER2023/videos/sample_00005961.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: He is dead. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "surprise"
            }
        ]
    },
    {
        "id": "220",
        "video": "MER2023/videos/sample_00004758.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Marry anyone, you can't marry this grandson. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "221",
        "video": "MER2023/videos/sample_00004905.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: So, what about the Dean's wife? No, yes, you, tell me again, I. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "surprise"
            }
        ]
    },
    {
        "id": "222",
        "video": "MER2023/videos/sample_00002981.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: You don't have money,, your phone is not here. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "223",
        "video": "MER2023/videos/sample_00000636.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Person: I don't care about you anymore. Think about it yourself, express yourself. The best thing is to be more romantic. Xiao Meng: If you're not satisfied, don't say it. I haven't reminded you before. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "224",
        "video": "MER2023/videos/sample_00000994.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Go to the office, go, let's handle this matter. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "225",
        "video": "MER2023/videos/sample_00003425.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: nan You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "226",
        "video": "MER2023/videos/sample_00003443.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: After seeing that Yang Dan had already fallen to the ground. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "227",
        "video": "MER2023/videos/sample_00002650.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: What are you doing at this moment? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "228",
        "video": "MER2023/videos/sample_00005158.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: But what I don't understand is why Shen Wei committed suicide. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "229",
        "video": "MER2023/videos/sample_00000257.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: So, he had no intention of coming to pick me up with your father. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "230",
        "video": "MER2023/videos/sample_00005982.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Open your little eyes, take a look. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "231",
        "video": "MER2023/videos/sample_00006886.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I don't believe it anymore, for over a decade. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "232",
        "video": "MER2023/videos/sample_00005104.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Spending money to invite people, having a meal,, having to scratch the surface of one's mouth. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "233",
        "video": "MER2023/videos/sample_00000516.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Hello, I would like to order a meal. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "234",
        "video": "MER2023/videos/sample_00006873.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Well, there's a hot pot restaurant that's quite suitable. I think the environment is also quite good,, it's okay. It comes with its own drinks. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "235",
        "video": "MER2023/videos/sample_00001857.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: What I am doing now, what I will do in the future are all voluntary. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "236",
        "video": "MER2023/videos/sample_00003387.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I tell you, he is clearly blackmailing us now. You should dare to go to the hospital with him. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "237",
        "video": "MER2023/videos/sample_00002883.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I warn you, saying: I can. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "238",
        "video": "MER2023/videos/sample_00004363.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Cao Dapeng, you should go to work immediately, otherwise I will dismiss you immediately. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "239",
        "video": "MER2023/videos/sample_00002745.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Under his name, one cannot kneel under your name, in order to avoid getting married to that house. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "240",
        "video": "MER2023/videos/sample_00000317.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Mom, you were so beautiful when you were young. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "241",
        "video": "MER2023/videos/sample_00002884.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: You are quite chaotic, you must not mess with people like that. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "242",
        "video": "MER2023/videos/sample_00002289.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Don't you know what kind of person I am? Am I someone who values money highly? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "243",
        "video": "MER2023/videos/sample_00005757.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I'm not saying it, this time it came very suddenly. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "244",
        "video": "MER2023/videos/sample_00002744.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: The family village, of course, was an important growth environment for me when I was young. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "245",
        "video": "MER2023/videos/sample_00000929.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: He, Cao Xiaoqiang, is wishful thinking. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "246",
        "video": "MER2023/videos/sample_00003176.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: The power supply has a single style,, once connected, it cannot be moved. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "247",
        "video": "MER2023/videos/sample_00003093.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: The best must be in the back. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "248",
        "video": "MER2023/videos/sample_00006116.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Me: It's delicious, delicious. Me: Serve you, I'll offer you up. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "249",
        "video": "MER2023/videos/sample_00006716.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: This is simply a person with character issues, you cannot follow the girl. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "250",
        "video": "MER2023/videos/sample_00006878.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: So, I came specifically, thank you. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "251",
        "video": "MER2023/videos/sample_00002580.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Ye. Ye. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "252",
        "video": "MER2023/videos/sample_00004752.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Do you like him? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "253",
        "video": "MER2023/videos/sample_00006581.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Wine talk, can you talk nonsense? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "254",
        "video": "MER2023/videos/sample_00002790.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: You need to go find Cao Xiaoqiang again. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "255",
        "video": "MER2023/videos/sample_00000637.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I'm not available. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "256",
        "video": "MER2023/videos/sample_00004653.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: He called me that day, but I didn't receive it. It was in a car accident. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "257",
        "video": "MER2023/videos/sample_00002609.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: That means there's no problem, you. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "258",
        "video": "MER2023/videos/sample_00005762.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Well, I won't wait for him anymore. It's like this. A month ago, my former boss of the company proposed merger, acquisition discipline to me. It's okay. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "259",
        "video": "MER2023/videos/sample_00006875.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: That. Marriage law, for those who have children, it is decided. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "260",
        "video": "MER2023/videos/sample_00000310.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Oh, don't ask him to come. You should make an appointment to go outside, in case that Zhu Jie comes, makes trouble again. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "261",
        "video": "MER2023/videos/sample_00006663.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: How do I feel so complicated? But, I just look at it this way,, I can't understand it either. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "262",
        "video": "MER2023/videos/sample_00006516.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: These people are very dangerous, should I go, give them to you? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "263",
        "video": "MER2023/videos/sample_00000724.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Come on, call me now, don't waste your time, don't waste my time, okay? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "264",
        "video": "MER2023/videos/sample_00000311.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I have to treat you to a meal. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "265",
        "video": "MER2023/videos/sample_00005847.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: At the press conference, yes, yes, I'm not good, I know. I've been apologizing, such as counting five steps, so I decided to resign to show responsibility. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "266",
        "video": "MER2023/videos/sample_00002762.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Do you treat this place like a court of law? Is that right? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "267",
        "video": "MER2023/videos/sample_00003570.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I understand. I can handle this matter. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "268",
        "video": "MER2023/videos/sample_00000532.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: No, you still keep it now. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "surprise"
            }
        ]
    },
    {
        "id": "269",
        "video": "MER2023/videos/sample_00003833.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Can I offset Shanshan's ten years of struggle, waiting? Do you think I can accept being a mother? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "270",
        "video": "MER2023/videos/sample_00003992.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: More importantly, no company will. Xin. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "271",
        "video": "MER2023/videos/sample_00003221.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: The patient's family requested an apology, but you were stunned again. What do you really want to do? Cannot go back, forth. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "272",
        "video": "MER2023/videos/sample_00006531.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: If it's cancelled, then cancel it. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "273",
        "video": "MER2023/videos/sample_00002834.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I'm just curious, so why are you there? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "274",
        "video": "MER2023/videos/sample_00002188.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Discuss, discuss, the marriage of these two children. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "275",
        "video": "MER2023/videos/sample_00002619.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: You two. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "276",
        "video": "MER2023/videos/sample_00003314.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Zhang Hai: Or. Grievance. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "277",
        "video": "MER2023/videos/sample_00004856.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: 3、 Fifty thousand? Is it too low? Me: This car, you can buy it anyway,, it will sell for 80000 yuan. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "surprise"
            }
        ]
    },
    {
        "id": "278",
        "video": "MER2023/videos/sample_00004153.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Do you also want me to disappear? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "279",
        "video": "MER2023/videos/sample_00001815.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: He, a fortune magazine all day long, big. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "280",
        "video": "MER2023/videos/sample_00003755.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I know, you two can be very good,, I am particularly happy. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "281",
        "video": "MER2023/videos/sample_00006971.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I will also do my best to take care of, protect his daughter. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "282",
        "video": "MER2023/videos/sample_00003856.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I didn't expect this to happen. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "283",
        "video": "MER2023/videos/sample_00003020.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Why does he blame us? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "284",
        "video": "MER2023/videos/sample_00001898.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Your father went to the store early in the morning. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "285",
        "video": "MER2023/videos/sample_00002831.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Team leader: Just come out this time, definitely. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "286",
        "video": "MER2023/videos/sample_00005666.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: But what about others? They have told you all their shortcomings. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "287",
        "video": "MER2023/videos/sample_00002066.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: This wine, wine, not to drink, watermelon juice, watermelon juice not to drink, what are you doing? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "288",
        "video": "MER2023/videos/sample_00006142.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Thirty three. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "surprise"
            }
        ]
    },
    {
        "id": "289",
        "video": "MER2023/videos/sample_00005998.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Hey, last night, I had a dream about a little girl. Male. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "290",
        "video": "MER2023/videos/sample_00006049.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Brother Siqing, are you saying thank you to me for the first time? Me: It's all a bit. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "291",
        "video": "MER2023/videos/sample_00002900.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: You understand, what do you think you know. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "292",
        "video": "MER2023/videos/sample_00003175.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: How could this happen. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "293",
        "video": "MER2023/videos/sample_00005306.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I use our hometown's agricultural, sideline products,, I have already decided on the name of the store, which is called \"Xiao\". You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "294",
        "video": "MER2023/videos/sample_00006567.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I would like to take a moment back to the station, say, 'I've been very busy lately, I haven't.'. Time. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "295",
        "video": "MER2023/videos/sample_00003535.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Okay, first treat us all to a meal. It's necessary,, I'll take my time with the rest. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "296",
        "video": "MER2023/videos/sample_00006138.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Okay, then. Me. Me. Loading is so expensive. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "297",
        "video": "MER2023/videos/sample_00003455.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: There won't be such a meaningful Spring Festival again, so please remember this. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "298",
        "video": "MER2023/videos/sample_00006316.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Rent a house? Me: When did you say you wanted to rent a house? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "299",
        "video": "MER2023/videos/sample_00001047.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: You see, here, on Saturday. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "300",
        "video": "MER2023/videos/sample_00005147.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Wolves with this temperament are all the same. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "301",
        "video": "MER2023/videos/sample_00005405.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Are you particularly afraid of me being here? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "302",
        "video": "MER2023/videos/sample_00003094.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: That's too much! Call the Consumer Association. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "303",
        "video": "MER2023/videos/sample_00004032.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Thank you! You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "304",
        "video": "MER2023/videos/sample_00000766.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Don't think about it, it depends on fate. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "305",
        "video": "MER2023/videos/sample_00003532.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: nan You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "surprise"
            }
        ]
    },
    {
        "id": "306",
        "video": "MER2023/videos/sample_00002543.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Can you sit down, listen to me for a few words. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "307",
        "video": "MER2023/videos/sample_00004548.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Not good, it's just Mom. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "308",
        "video": "MER2023/videos/sample_00006419.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Planning clearly belongs to the difficulty of execution. 5、 Pieces. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "309",
        "video": "MER2023/videos/sample_00003279.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: A despicable way of inciting students to do such things. You. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "310",
        "video": "MER2023/videos/sample_00006757.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Whoever plays this pole far wins. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "311",
        "video": "MER2023/videos/sample_00004937.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Ah. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "312",
        "video": "MER2023/videos/sample_00005421.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Zhou Chuang: I'm in Director Li's office. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "surprise"
            }
        ]
    },
    {
        "id": "313",
        "video": "MER2023/videos/sample_00001772.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Those channel merchants have been cooperating with us for decades, leading to lawsuits. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "314",
        "video": "MER2023/videos/sample_00003369.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: He needs to taste this taste. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "315",
        "video": "MER2023/videos/sample_00000849.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: The head office can arrange opportunities for employees to travel, study abroad. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "316",
        "video": "MER2023/videos/sample_00001638.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: You don't need to understand, I can come, see you, that's all. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "317",
        "video": "MER2023/videos/sample_00000452.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Find Earthworm Sister Fan to teach you a truth: When a man chases you, it's always when she puts the most effort into it. If at this time, she won't even give money. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "318",
        "video": "MER2023/videos/sample_00005063.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Now, I don't need it anymore. You can go now. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "319",
        "video": "MER2023/videos/sample_00003381.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I am already very happy. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "320",
        "video": "MER2023/videos/sample_00000825.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Are you following the car? Still. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "321",
        "video": "MER2023/videos/sample_00006900.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: That's it, you just need to pull strings, build bridges. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "322",
        "video": "MER2023/videos/sample_00003584.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Isn't it having a child, as for being so awesome? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "323",
        "video": "MER2023/videos/sample_00006826.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: But I am not competing with you. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "324",
        "video": "MER2023/videos/sample_00001099.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: That's right, am I falling behind? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "325",
        "video": "MER2023/videos/sample_00003346.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Um, yes, yes, I am Mr. Wu. Hello. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "326",
        "video": "MER2023/videos/sample_00003452.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Of course it is. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "327",
        "video": "MER2023/videos/sample_00001901.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: It takes at least twenty seconds for Lin's car to drive from the accident site to the exit. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "328",
        "video": "MER2023/videos/sample_00002127.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: So, sit down, I'll go first. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "329",
        "video": "MER2023/videos/sample_00003522.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Bicycle: The matter with Liu Siming has something to do with me. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "330",
        "video": "MER2023/videos/sample_00001227.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Is everything okay at home. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "331",
        "video": "MER2023/videos/sample_00006574.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Me: I think you two are quite suitable. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "332",
        "video": "MER2023/videos/sample_00003516.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I feel like, ah, this Ouyang used to be back. You see, staying in the emergency department is great. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "333",
        "video": "MER2023/videos/sample_00005275.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Thank you! You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "334",
        "video": "MER2023/videos/sample_00000422.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: So, do you think I used to be that good, or am I now? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "335",
        "video": "MER2023/videos/sample_00002524.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I won't become a machine that only knows how to copy for the sake of making money. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "336",
        "video": "MER2023/videos/sample_00001291.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Come, find you. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "337",
        "video": "MER2023/videos/sample_00006190.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Long time, long time, long time. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "surprise"
            }
        ]
    },
    {
        "id": "338",
        "video": "MER2023/videos/sample_00004914.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Me: Go to Wei Ye's house. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "339",
        "video": "MER2023/videos/sample_00003198.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Mei: Is there any solution now, or is there someone who needs help. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "340",
        "video": "MER2023/videos/sample_00000026.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Mmm. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "341",
        "video": "MER2023/videos/sample_00002919.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Okay: If you need any help. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "342",
        "video": "MER2023/videos/sample_00003222.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Both. I've brought a few of them, so now that I'm an adult, you won't mind me anymore. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "343",
        "video": "MER2023/videos/sample_00000236.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I have to go, Xiao. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "344",
        "video": "MER2023/videos/sample_00004880.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: That. So, how do you live alone? Where do you live? have No. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "345",
        "video": "MER2023/videos/sample_00005577.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: How much effort has been put into it! You two, you can't let us fight like a bamboo basket. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "346",
        "video": "MER2023/videos/sample_00004394.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: You and? This is my problem,, the reason I chose to break up with him is because there is a high possibility. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "347",
        "video": "MER2023/videos/sample_00005715.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: nan You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "348",
        "video": "MER2023/videos/sample_00007022.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: At that time, I thought the same thing, but there were tags hanging on each bag,, they were all on top. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "349",
        "video": "MER2023/videos/sample_00001376.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: nan You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "350",
        "video": "MER2023/videos/sample_00005348.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Shanghai Meishu Junior College. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "351",
        "video": "MER2023/videos/sample_00006820.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Dad. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "352",
        "video": "MER2023/videos/sample_00001991.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I just told him to calm down at the time. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "353",
        "video": "MER2023/videos/sample_00004030.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Ah. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "surprise"
            }
        ]
    },
    {
        "id": "354",
        "video": "MER2023/videos/sample_00003069.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Take care of yourself, drink less alcohol. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "355",
        "video": "MER2023/videos/sample_00000618.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I no longer have the qualifications to pursue a simple emotional relationship. In reality, everyone is measuring you with a ruler. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "356",
        "video": "MER2023/videos/sample_00003181.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Don't say that, I can't even do it for Jiahe. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "357",
        "video": "MER2023/videos/sample_00005045.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Hehe. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "358",
        "video": "MER2023/videos/sample_00001568.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Work, it's work, I have no problem. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "359",
        "video": "MER2023/videos/sample_00004455.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: What would you choose as the leader of the pursuit, escape team? Case? Me. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "360",
        "video": "MER2023/videos/sample_00002629.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Some people who borrow money simply don't take it,, if they own it, they will take it as a pawn,, outsiders will adopt my children, which is not clear. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "361",
        "video": "MER2023/videos/sample_00005539.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Me: No way! He: If you don't speak clearly. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "362",
        "video": "MER2023/videos/sample_00005075.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: You wait for our house, I don't want to see you again. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "363",
        "video": "MER2023/videos/sample_00005094.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I just can't let him go. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "364",
        "video": "MER2023/videos/sample_00006309.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: It's okay, it's pretty good, that's it. Last night. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "365",
        "video": "MER2023/videos/sample_00006253.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Ah. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "366",
        "video": "MER2023/videos/sample_00002843.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Come to my office. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "367",
        "video": "MER2023/videos/sample_00000408.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I don't have either. He has money, so you. Both. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "368",
        "video": "MER2023/videos/sample_00005600.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Teacher Sun, he is my idol. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "369",
        "video": "MER2023/videos/sample_00003873.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Why? Admit your mistake? Me: Why admit your mistake? Do their patient families not have any responsibility? Why? Have all the questions been pushed onto me? Me: Of course not. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "370",
        "video": "MER2023/videos/sample_00004224.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: No, where are you going? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "surprise"
            }
        ]
    },
    {
        "id": "371",
        "video": "MER2023/videos/sample_00003463.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Attention, you, calm down, I'm also a doctor! Is it possible. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "372",
        "video": "MER2023/videos/sample_00000470.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Are parents clear now? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "373",
        "video": "MER2023/videos/sample_00002161.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: However, it seems that this time, including playing the beautiful contract. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "374",
        "video": "MER2023/videos/sample_00003517.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Me: Rationale: Isn't there a chair? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "375",
        "video": "MER2023/videos/sample_00002540.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Is it incurable? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "376",
        "video": "MER2023/videos/sample_00000386.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Peaches, plum kui, in case it doesn't work out. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "377",
        "video": "MER2023/videos/sample_00000148.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Brother-in-law. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "surprise"
            }
        ]
    },
    {
        "id": "378",
        "video": "MER2023/videos/sample_00000649.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I can't accompany you, I have to read at home. This is the homework assigned for my brother. Today. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "379",
        "video": "MER2023/videos/sample_00005263.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I want my sister to finish high school well, get into college well. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "380",
        "video": "MER2023/videos/sample_00006961.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: General, weren't you at the scene? How can you see it? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "381",
        "video": "MER2023/videos/sample_00002769.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: What's wrong? You? Hurry up! What's wrong? You. What's wrong? You. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "382",
        "video": "MER2023/videos/sample_00005401.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Ah. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "383",
        "video": "MER2023/videos/sample_00002634.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Moreover, whether it's people, work, personality, or getting married. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "384",
        "video": "MER2023/videos/sample_00005112.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: We want to. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "385",
        "video": "MER2023/videos/sample_00004295.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: However, emotional matters must be my own responsibility. This matter comes to an end, will not be discussed again. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "386",
        "video": "MER2023/videos/sample_00001006.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Oh, what's wrong. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "worried"
            }
        ]
    },
    {
        "id": "387",
        "video": "MER2023/videos/sample_00000402.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Laozi: No more! Me: Who? On the internet platform? Send a glass of wine? Me: All stabbed in his face. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "388",
        "video": "MER2023/videos/sample_00002761.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Okay, okay, you can make your decision, so can I. This family doesn't welcome you. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "389",
        "video": "MER2023/videos/sample_00000008.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Wait for my dad, leave me here alone. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "390",
        "video": "MER2023/videos/sample_00001498.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: No, it's about eating. It's about not eating. Me: I'm waiting for the next shift, there's something else to do. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "391",
        "video": "MER2023/videos/sample_00003150.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: So fat, can you get along with people? Over 300 pounds, this bus is like a pig. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "392",
        "video": "MER2023/videos/sample_00001237.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: If you're still angry, beat me up. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "393",
        "video": "MER2023/videos/sample_00003594.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: I'm going to see Empress, Xiaonan. If they're okay, let them accompany me for a walk. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "happy"
            }
        ]
    },
    {
        "id": "394",
        "video": "MER2023/videos/sample_00000947.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Date: Sitting in this position naturally gives you the ability to distinguish right from wrong. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ]
    },
    {
        "id": "395",
        "video": "MER2023/videos/sample_00006954.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: This experience has been experienced, as well as being played in Ligue 1. This teammate was sent off the field,, then this team was affected. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "396",
        "video": "MER2023/videos/sample_00003601.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Where are you? What are you doing? You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "surprise"
            }
        ]
    },
    {
        "id": "397",
        "video": "MER2023/videos/sample_00002346.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: The most expensive one costs ten thousand yuan. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    },
    {
        "id": "398",
        "video": "MER2023/videos/sample_00005426.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: Me: Son, when you go to the hospital, you should go back to Jinan. He stayed in Tangshan because of you. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "angry"
            }
        ]
    },
    {
        "id": "399",
        "video": "MER2023/videos/sample_00004672.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe person in video says: In two days, we will send you back to your hometown. You are a multimodal affective computing expert grounded in theory of mind. Your task is to infer the speaker’s current emotional state by reasoning about their mental world from multimodal evidence. Complete the multimodal emotion recognition task as follows: (1) Decode Affective Signals (Perceptual Simulation): Systematically decode observable cues in each available channel—visual behavior (face, gaze, posture, gesture), speech prosody (tone, pitch, intensity, tempo) if available, and lexical content/transcript if provided—as outward evidence of an inner state. Briefly note the present cues and their likely emotion implications. (2) Synthesize a Mental State Hypothesis (Cognitive Empathy): Integrate the decoded cues into a coherent first-order hypothesis about what the speaker is feeling now. If the cues converge, state that they reinforce the same valence. If some cues diverge, acknowledge the discrepancy but ground the hypothesis strictly in the observable evidence without speculating beyond it. (3) Attribute from the Speaker’s Perspective (Perspective-Taking): Refine the hypothesis to attribute the single most plausible here-and-now emotional state from the speaker’s perspective. Exclude interpretations based on the viewer’s reaction or on the speaker’s stable traits. (4) Conclude and Map to Label: Provide your reasoning following the steps above, then choose exactly one label from: happy, sad, neutral, angry, worried, surprise, and output exactly {'emotion': 'label'}."
            },
            {
                "from": "gpt",
                "value": "neutral"
            }
        ]
    }
]