Abstract: For decades, we are interested in detecting objects and classifying them into a fixed vocabulary of lexicon. With the maturity of these low-level vision solutions, we are hunger for a higher-level representation of the visual data, so as to extract visual knowledge rather than merely bags of visual entities, allowing machines to reason about human-level decision-making and even manipulate the visual data at the pixel-level. In this tutorial, we will introduce a various of machine learning techniques for modeling visual relationships (e.g., subject-predicate-object triplet detection) and contextual generative models (e.g., generating photo-realistic images using conditional generative adversarial networks). In particular, we plan to start from fundamental theories on object detection, relationship detection, generative adversarial networks, to more advanced topics on referring expression visual grounding, pose guided person image generation, and context based image inpainting.
Loading