Abstract: A key objective in multiview learning is to model the information common to multiple parallel views of a class of objects/events to improve downstream tasks such as classification and clustering. In this context, two open research challenges remain; achieving scalability: how can we incorporate information from hundreds of views per event into a model? and being view-agnostic: how to learn robust multiview representations without knowledge of how these views are acquired? In this work, we study a neural method based on multiview correlation to capture the information shared across a large number of views by subsampling them in a view-agnostic manner during training. We analyze the error of this bootstrapped multiview correlation objective using matrix concentration theory to provide an upper bound on the number of views to subsample for a given embedding dimension. Our experiments on a diverse set of audio and visual tasks-multi-channel acoustic activity classification, spoken word recognition, 3D object classification, and pose-invariant face recognition-demonstrate the robustness of view bootstrapping to model a large number of views. Results and analysis underscore the applicability of our method for a view-agnostic learning setting.
0 Replies
Loading