XFlow: Cross-modal Dataflow Neural Networks for Audiovisual ClassificationDownload PDF

04 Feb 2018ICLR 2018 Workshop SubmissionReaders: Everyone
Abstract: We propose two multimodal deep learning architectures that allow for cross-modal dataflow (XFlow) between several feature extractors, deriving more interpretable features and obtaining a better representation than through unimodal learning. These models can usefully exploit correlations between audio and visual data, which have a different dimensionality and are nontrivially exchangeable. Our work improves on existing multimodal research in two essential ways: (1) it presents a novel method for performing cross-modality (which could easily be generalised to other kinds of data) and (2) extends the previously proposed cross-connections which only transfer information between streams that process compatible data. We also illustrate some of the representations learned by the connections and present Digits, a new dataset consisting of three audiovisual data types. Both architectures outperformed their baselines and achieved state-of-the-art results on AVletters and CUAVE.
TL;DR: Two novel multimodal deep learning architectures with cross-modal dataflow in the feature extraction phase. State-of-the-art results on three audiovisual classification benchmarks.
Keywords: machine learning, deep learning, multimodal, audiovisual, classification, cross-modal, cross-connections
4 Replies

Loading