Abstract: Following the success of deep convolutional networks,state-of-the-art methods for 3d human pose estimation havefocused on deep end-to-end systems that predict 3d jointlocations given raw image pixels. Despite their excellentperformance, it is often not easy to understand whethertheir remaining error stems from a limited 2d pose (visual)understanding, or from a failure to map 2d poses into 3-dimensional positions.With the goal of understanding these sources of error,we set out to build a system that given 2d joint locationspredicts 3d positions. Much to our surprise, we have foundthat, with current technology, “lifting” ground truth 2d jointlocations to 3d space is a task that can be solved with aremarkably low error rate: a relatively simple deep feed-forward network outperforms the best reported result byabout 30% on Human3.6M, the largest publicly available3d pose estimation benchmark. Furthermore, training oursystem on the output of an off-the-shelf state-of-the-art 2ddetector (i.e., using images as input) yields state of the artresults – this includes an array of systems that have beentrained end-to-end specifically for this task. Our results in-dicate that a large portion of the error of modern deep 3dpose estimation systems stems from their visual analysis,and suggests directions to further advance the state of theart in 3d human pose estimation.
0 Replies
Loading