Keywords: image manipulation, predictive learning, relational network, cognitive learning, image generation
Abstract: This paper studies whether a perceptual visual system can simulate human-like cognitive capabilities by training a computational model to predict the output of an action using language instruction. The aim is to ground action words such that an AI is able to generate an output image that outputs the effect of a certain action on an given object. The output of the model is a synthetic generated image that demonstrates the effect that the action has on the scene. This work combines an image encoder, language encoder, relational network, and image generator to ground action words, and then visualize the effect an action would have on a simulated scene. The focus in this work is to learn meaningful shared image and text representations for relational learning and object manipulation.