Exploring Fine-Grained Human Motion Video CaptioningDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
Abstract: Fine-grained human motion descriptions are crucial for people's fitness training as well as their health management. Naturally, it brings the problem of fine-grained human motion video-to-text generation into our focus. Previous video captioning models, including LLM-driven methodologies, are short of capturing fine-grained semantics of the videos through modeling. Meanwhile, the generated descriptions are brief and lack fine details in demonstrating human motion. Hence, existing methods driven by short and coarse-grained ground-truth descriptions still have room for improvement, given the fact that datasets with fine-grained, annotated long text are in deficiency. In this paper, we construct a fine-grained motion video captioning dataset named BoFit (body fitness training), which is composed of fitness training videos, paired with human motion descriptions temporally at step granularity and spatially at body-trunk granularity. We also implement a state-of-the-art baseline named PoseGPT, with the assistance of the 3D Human Pose Estimation model, MotionBERT. It extracts angular representations of the videos and encodes them into prompts. These prompts are later used by LLMs to generate fine-grained descriptions of human motions. Results show that PoseGPT outperforms other previous methodologies on comprehensive metrics. We aim for this dataset to serve as a useful evaluation set for visio-linguistic models and drive further progress in this field.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Approaches to low-resource settings, Data resources
Languages Studied: English
0 Replies

Loading