Above-Screen Fingertip Tracking and Hand Representation for Precise Touch Input with a Phone in Virtual Reality

Fabrice Matulic; Taiga Kashima; Deniz Beker; Daichi Suzuo; Hiroshi Fujiwara; Daniel Vogel

Above-Screen Fingertip Tracking and Hand Representation for Precise Touch Input with a Phone in Virtual Reality

Fabrice Matulic, Taiga Kashima, Deniz Beker, Daichi Suzuo, Hiroshi Fujiwara, Daniel Vogel

Published: 13 May 2024, Last Modified: 28 May 2024GI 2024 SDEveryoneRevisionsBibTeXCC BY 4.0

Letter Of Changes: We thank the reviewers for their comments and accordingly revise our paper. We tried to shorten it without removing important information and sacrificing explanation clarity, but adding the many details requested by reviewers and author information unfortunately more than offset those efforts. ## Figures We moved figure 10 to a different location to enhance clarity. We also reformatted study result figures so that content isn't cut off. For application illustrations, we created a single figure 11 with subfigures showing the different applications. ## Move content in the participant ratings analysis to the discussion: We moved the last two sentences of that section about speculations as to how speed and accuracy could be improved to the discussion. The rest is mostly summaries of participant comments and their interpretations, which we feel belong in that section. ## Per-task time analysis in the user study We analysed tapping and tracing trials separately and reported mean completion times for each trial type instead of total task completion time as requested by R2. The updated figure 9 reflects those changes. ## Study result reporting As requested by R2, we added time and error values and their std and concrete comparative qualifiers in the reporting of the study results, and std for the RMSEs in the evaluation of the tracking precision and for the end-to-end latency. ## Reason for not using touchscreen coordinates for the touch point We briefly explain that choice at the end of section 4.3, but we added a justification earlier in the relevant section about determining the touch point (4.2). ## Influence of skin colour One of the contributors of training data and also a study participant was dark-skinned, and we did not observe significant differences with other participants (mostly Asians). Skin colour would mostly be important for BASNet segmentation of the phalanges, but as with almost all machine learning techniques involving images of humans, ensuring that such systems work well for a diversity of people depends on the diversity of the training data. ## Comparison with 2D Overlay + Marker This would indeed be an interesting comparison and there are no doubt many ways to improve visual feedback of hands on the screen, which could be compared with 3D hands, but our study already included 6 conditions, and for the Phonetroller baseline, we used the technique that was evaluated in that paper. ## Context missing in introduction for hand representations optimised for precise touch input We replaced that description with that of the specific hand models we use, i.e. cursor, stick fingers, and hands with (virtual) markers. ## Claim that the technique is self-supervised based on two phones We have tested it with a third phone and also softened that claim, but note that images can always be preprocessed to align more closely with the images our BASNet segmenter expects. We have further trained our BASNet model using various images augmentations to increase its robustness, so we don't think that our initial claim was exaggerated. ## Claims that participants were affected by weight and latency Participants did casually mention that the phone was heavy (we added that to the sentence). They didn't explicitly mention latency, but responsiveness is an obvious factor that influences interaction performance as demonstrated by prior work, which we cite. ## Mentioning the number of labelled images in the abstract and the conclusion This is an implementation detail, which wouldn't seem to fit quite well in a summary, but we mentioned the order of magnitude of labelled images in the abstract and the conclusion (a few hundreds) to give an idea of the labelling requirements. ## Clarifications in subsection 3.2.2 - We used the full name for the distal interphalangeal joint instead of the acronym. - We reformulated the sentences about the polynomial curve, the AND operation for the filter and missed tap trials. - We added short descriptions for m_proj, m_est and Tip_proj Regarding data augmentation, this is a standard machine learing technique to improve model robustness with a given set of input images. The particular filters we use are standard filters with default parameters (which are sufficient in most cases) that can be changed by the developer. Those parameters are usually not reported as they are not so significant. ## Was tapping/tracing triggered by screen events? Touch events are captured by the phone's capacitive sensor, so yes, tapping and tracing are triggered by screen events. ## Definitions of sense of control for participants We added those definitions. ## Additional ANOVAs after interaction effects The interaction effect only shows that the effect of an independent variable depends on the level of the other. It doesn't provide information about simple main effects when fixing those levels for isolated analyses of the IVs. ## Unrelated footnote on page 2 We fixed that issue.

Keywords: Virtual reality, hand pose estimation, mobile phone interaction

TL;DR: Deep learning technique to track thumb and index finger tips manipulating a mobile phone in VR using the phone's camera and two mirrors, with a study comparing different hand representations.

Abstract: Interacting with the touchscreen of a mobile phone in virtual reality (VR) is challenging because users cannot see their fingers when aiming for targets. We propose using two mirrors reflecting the front camera of the phone and a purpose-built deep neural network to infer the 3D position of fingertips above the screen. Network training is self-supervised after only a few hundred initial labelled images and does not require any external sensor. The inferred fingertip positions can be used to control different hand models and objects in VR. Controlled experiments evaluate tracking performance for single-finger touch input, and compare several 3D hand representations with a flat 2D overlay used in previous work. The results confirm the suitability of our fingertip tracker to aid precise tapping of small targets on the phone screen and provide insights about the effect of various hand representations on control and presence. Finally, we provide several application examples showing how 3D fingertip input can complement and extend phone-based touch interaction in VR.

Supplementary Material: zip

Video: zip

Submission Number: 52

Loading