# EgoDex

EgoDex is a large-scale dataset and benchmark for egocentric dexterous manipulation collected on Apple Vision Pro. 

The dataset has 829 hours of 30 Hz 1080p egocentric video with paired 3D pose annotations for the head, upper body, 
and hands as well as natural language annotation. It consists entirely of active tabletop manipulation across 194 diverse tasks. 

As part of the supplementary material, we provide sample sequences of the dataset. Full data access and data loading code will be available 
for download upon request due to the double-blind review process and technical difficulty of maintaining anonymity 
due to institutional requirements.

## Dataset Structure 

Each episode has a paired HDF5 file and MP4 file (e.g., `0.hdf5` and `0.mp4`).
The pose annotations at each frame of the MP4 file are contained in the corresponding HDF5 file. 

Each HDF5 file has the structure below, where `N` is the number of frames. 

```
camera
└──intrinsic            # 3 x 3 camera intrinsics. always the same in every file.

transforms              # all joint transforms, all below have shape N x 4 x 4.
└──camera               
└──leftHand             
└──rightHand            
└──leftIndexFingerTip
└──leftIndexFingerKnuckle
└──(64 more joints...)

confidences             # (optional) scalar joint confidences, all below have shape N.
└──leftHand
└──rightHand
└──(66 more joints...)
```

If the corresponding MP4 file is `T` seconds long, then `N = 30 * T`. The first transform of each joint corresponds to the first 
frame of the video. The file contains skeletal SE(3) pose data for all joints. Note that all transforms (including the camera extrinsics, 
`transforms/camera`) are expressed in the *ARKit origin frame*: a stationary frame on the ground set at the beginning of a recording 
session. Since this depends on device initialization, this world frame is not necessarily consistent across episodes 
(though it is stationary during an episode).

Language metadata annotations can be accessed under the HDF5 file attributes. In Python, if f is the hdf5 file, you can access this with 
f.attrs['llm_description']. See f.attrs.keys() for a full list of attributes available.

## Sample Loading HDF5

Loading the HDF5 file is straightforward using a python package, such as h5py.

```
import h5py

frame_id = 1
with h5py.File(hdf5_file, "r") as f:
    # example loading of transforms
    cam_ext = f['/transforms/camera'][frame_id] # camera extrinsics at frame_id (SE(3) transform of camera in world frame)
    cam_int = f['/camera/intrinsic'][:] # camera intrinsics
    T_world_leftHand_at_frame_id = f['/transforms/leftHand'][frame_id]   # SE(3) transform of leftHand in world frame at frame_id
    T_world_leftHand_all_frames = f['/transforms/leftHand'][:]    # all SE(3) transforms of leftHand in world frame


```