T3Set: A Multimodal Dataset with Targeted Suggestions for LLM-based Virtual Coach in Table Tennis Training
Abstract: Coaching is critical for learning table tennis skills. However, amateur table tennis players often lack access to professional coaches due to high costs and a limited number of coaches. While recent multimodal large language models show promise as virtual coaches, most of the existing approaches merely rely on video analysis, which is not comprehensive enough. In table tennis, many important kinematic details (e.g., strength, acceleration) cannot be captured by videos. They can only be tracked using sensors. To address this gap, we present T3Set (Table Tennis Training Set), a multimodal dataset that synchronizes inertial measurement unit (IMU) data from sensors mounted on 32 players' rackets with video recordings. The sensor data has 16 dimensions and a sample rate of 100Hz. This dataset covers 7 fundamental techniques across 380 training rounds, totaling 8655 annotated strokes, with 8395 targeted suggestions from coaches. The key features of T3Set include (1) temporal alignment between sensor data, video data, and text data. (2) high-quality targeted suggestions which are consistent with predefined suggestion taxonomy. Based on T3Set, we propose a novel two-stage framework that effectively integrates motion perception with generative reasoning as a virtual coach. Our method quantitatively outperforms baseline methods. The dataset, code, and documentation are available at https://github.com/jima-cs/T3Set.
External IDs:doi:10.1145/3711896.3737407
Loading