- Abstract: In recent years, deep neural networks are used mainly as discriminators of multimodal learning. We should have large amounts of labeled data for training them, but obtaining such data is difficult because it requires much labor to label inputs. Therefore, semi-supervised learning, which improves the discriminator performance using unlabeled data, is important. Among semi-supervised learning, methods based on deep generative models such as variational autoencoders (VAEs) are known to be trained end-to-end with high accuracy. In this paper, we propose a novel model of semi-supervised multimodal learning based on multimodal VAEs: SS-HMVAE. Furthermore, to cope with unimodal inputs in test data, we propose an extended model based on existing studies of complementation of missing values, which we call SS-HMVAE-kl. From experimentation, we confirm that the proposed model has higher performance than either conventional unimodal or multimodal semi-supervised learning.