Abstract: Understanding contextual information composed of both text and images is very useful for multimedia information processing. However, capturing such contexts is not trivial, especially while dealing with real datasets. Existing solutions such as using ontologies (e.g., WordNet) are mainly interested in individual terms, but they do not support identifying a group of terms that describe a specific context available at runtime. Within our knowledge, there are very limited solutions regarding the integration of contextual information from both images and text. Furthermore, existing solutions are not scalable due to the computationally intensive tasks and are prone to data sparsity. In this paper, we propose a semantic framework, called VisContext that is based on a contextual model combined with images and text. The VisContext framework is based on the scalable pipeline that is composed of the primary components as follows: (i)Natural Language Processing (NLP), (ii) Feature extraction using Term Frequency-Inverse Document Frequency (TF-IDF), (iii) Feature association using unsupervised learning algorithms including K-Means clustering (KM) and Expectation-Maximization(EM) algorithms, iv) Validation of visual context models using supervised learning algorithms (Naïve Bayes, Decision Trees, Random Forests). The proposed VisContext framework has been implemented with the Spark MLlib and CoreNLP. We have evaluated the effectiveness of the framework in visual understanding with three large datasets (IAPR, Flick3k, SBU) containing more than one million images and their annotations. The results are reported in the discovery of the contextual association of terms and images, image context visualization, and image classification based on contexts.
0 Replies
Loading