Keywords: multimodal fusion, representation learning, conditional gating, pre-stroke anticipation, intention prediction
Abstract: Predicting the future in dynamic environments requires reasoning about the in-
tentions of agents from rich, multi-modal data. We introduce a novel machine
learning problem: pre-intervention anticipation—forecasting outcomes before an
action is completed by fusing contextual cues with ongoing sensor data. To ad-
dress this, we propose ConFu, a general neural architecture featuring two key
innovations: (1) a conditional gating mechanism that dynamically modulates pri-
mary features (e.g., trajectory) based on secondary context (e.g., intention cues),
and (2) a cross-fusion strategy for systematic multi-stage integration of heteroge-
neous modalities. Our model achieves a prediction accuracy of 92.6% with
a mean absolute error of 0.20 meters, significantly outperforming existing
methods by 7.8-10.5% in accuracy. Experimental validation on a real-world
badminton dataset comprising 13,582 strokes demonstrates that ConFu provides
immediate tactical feedback, saving 85% decision time compared to trajectory-
based approaches. This time advantage is particularly valuable for practical appli-
cations such as enabling badminton robots to compute interception strategies.
Our work establishes a foundation for intention-aware prediction, with broader
implications for robotics, autonomous systems, and human-AI interaction. Code
will be released for reproducibility (https://anonymous.4open.science/r/AI-
Sport18-BFE9/README.md.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 20040
Loading