OctoNet: A Large-Scale Multi-Modal Dataset for Human Activity Understanding Grounded in Motion-Captured 3D Pose Labels
Keywords: multi-modal dataset, human activity understanding, human pose estimation, non-intrusive, wireless sensing
TL;DR: A multimodal human activity and pose dataset for diversify sensing including acoustic, RF-based, vision-based, inertial, and physiological modalities
Abstract: We introduce OctoNet, a large-scale, multi-modal, multi-view human activity dataset designed to advance human activity understanding and multi-modal learning. OctoNet comprises 12 heterogeneous modalities (including RGB, depth, thermal cameras, infrared arrays, audio, millimeter-wave radar, Wi-Fi, IMU, and more) recorded from 41 participants under multi-view sensor setups, yielding over 67.72M synchronized frames. The data encompass 62 daily activities spanning structured routines, freestyle behaviors, human-environment interaction, healthcare tasks, etc. Critically, all modalities are annotated by high-fidelity 3D pose labels captured via a professional motion-capture system, allowing precise alignment and rich supervision across sensors and views. OctoNet is one of the most comprehensive datasets of its kind, enabling a wide range of learning tasks such as human activity recognition, 3D pose estimation, multi-modal fusion, cross-modal supervision, and sensor foundation models. Extensive experiments have been conducted to demonstrate the sensing capacity using various baselines. OctoNet offers a unique and unified testbed for developing and benchmarking generalizable, robust models for human-centric perceptual AI.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/hku-aiot/OctoNet
Code URL: https://github.com/aiot-lab/OctoNet/tree/main
Supplementary Material: pdf
Primary Area: Other (please use sparingly, only use the keyword field for more details)
Submission Number: 894
Loading