Abstract: Generative Adversarial Networks (GANs) can produce images of surprising complexity and realism but are generally structured to sample from a single latent source ignoring the explicit spatial interaction between multiple entities that could be present in a scene. Capturing such complex interactions between different objects in the world, including their relative scaling, spatial layout, occlusion, or viewpoint transformation is a challenging problem. In this work, we compose a pair of objects in a conditional GAN framework using a novel self-consistent composition-by-decomposition network. Given object images from two distinct distributions, our model can generate a realistic composite image from their joint distribution following the texture and shape of the input objects. Our results reveal that the learned model captures potential interactions between the two object domains, and can output their realistic composed scene at test time.