Abstract: In the current era, the mode of communication through mobile devices is becoming more personalized with the evolution of touch-based input methods. While writing on touch-responsive devices, searching for emojis to capture the true intent is cumbersome. To solve this problem, the existing solutions consider either the text or only stroke-based drawings to predict the appropriate emojis. We do not leverage the full context by considering only a single input. While the user is digitally writing, it is challenging for the model to identify whether the intention is to write text or draw an emoji. Moreover, the model’s memory footprint and latency play an essential role in providing a seamless writing experience to the user. In this paper, we investigate the effectiveness of combining text and drawing as input to the model. We present SAMNet, a multimodal deep neural network that jointly learns the text and image features. Here image features are extracted from the stroke-based drawing and text from the previously written context. We also demonstrate the optimal way to fuse features from both modalities. The paper focuses on improving user experience and providing low latency on edge devices. We trained our model with a carefully crafted dataset of 63 emoji classes and evaluated the performance. We achieve a worst-case On-Device inference time of 60 ms and 76.74% top-3 prediction accuracy with a model size of 3.5 MB. We evaluated the results with the closest matching application-DigitalInk and found that SAMNet provided a 13.95% improvement in the top-3 prediction accuracy.
0 Replies
Loading