Reading the Room: Learning Group States Beyond Pooled Individual Signals

Navid Salami Pargoo; Kumar Akash; Teruhisa Misu; Zahra Zahedi; Jorge Ortiz; Zhaobo Zheng

Reading the Room: Learning Group States Beyond Pooled Individual Signals

Navid Salami Pargoo, Kumar Akash, Teruhisa Misu, Zahra Zahedi, Jorge Ortiz, Zhaobo Zheng

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Representation Learning, Group Dynamics, Emergent States, Social Signal Processing

Abstract: The fundamental challenge in modeling group dynamics is that collective states arise from interdependent processes that violate the standard assumption of independent observations. We study this in a triadic Collaborative Problem-Solving (CPS) task where five constructs (group synchrony, group confidence, group interaction phase, individual engagement, individual leadership) are annotated as directional trends (increase / stable / decrease) over short windows. We formulate a hypothesis that group-level states are not reliably recovered from pooled individual features due to the aggregation fallacy. To test this, we introduce Syntality, a benchmark with participant-indexed multimodal streams and paired individual+group trend labels, and SyntalNet, an architecture that satisfies three minimal requirements: (i) permutation-equivariant cross-participant fusion, (ii) mask-aware intra-modality fusion, and (iii) low-rank cross-modal interactions. On Syntality, SyntalNet consistently outperforms additive baselines, improving group-level macro-F1 from 0.37–0.49 to 0.62 while also achieving strong individual-level performance (e.g., 0.64 balanced accuracy / 0.85 AUROC / 0.63 F1 on engagement; 0.60 / 0.78 / 0.58 on leadership). Under 5-fold cross-validation we show that gains are statistically significant. In a leave-one-group-out (LOGO) setting, zero-shot macro-F1 on held-out teams recovers to 0.47 when fine-tuning on 25\% of the target group, and to 0.58 when freezing the encoder and adapting only classifier heads. Ablation studies confirm that explicit cross-participant fusion, intra-modality fusion, and low-rank cross-modal fusion each contribute to robustness under various corruption scenarios. Critically, our results provide empirical evidence that pooled individual signals yield performance statistically equivalent to constant predictors on group states in this triadic CPS setting, highlighting the necessity of explicit cross-participant modeling. We will release our dataset with processed features, code repository, and models upon acceptance.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 22771

Loading