Efficient and Distributed Algorithms for Large-Scale Generalized Canonical Correlations Analysis

Xiao Fu, Kejun Huang, Evangelos E. Papalexakis, Hyun Ah Song, Partha Pratim Talukdar, Nicholas D. Sidiropoulos, Christos Faloutsos, Tom M. Mitchell

2016 (modified: 04 Mar 2025)ICDM 2016Readers: Everyone

Abstract: Generalized canonical correlation analysis (GCCA) aims at extracting common structure from multiple 'views', i.e., high-dimensional matrices representing the same objects in different feature domains – an extension of classical two-view CCA. Existing (G)CCA algorithms have serious scalability issues, since they involve square root factorization of the correlation matrices of the views. The memory and computational complexity associated with this step grow as a quadratic and cubic function of the problem dimension (the number of samples / features), respectively. To circumvent such difficulties, we propose a GCCA algorithm whose memory and computational costs scale linearly in the problem dimension and the number of nonzero data elements, respectively. Consequently, the proposed algorithm can easily handle very large sparse views whose sample and feature dimensions both exceed 100,000 – while the current approaches can only handle thousands of features / samples. Our second contribution is a distributed algorithm for GCCA, which computes the canonical components of different views in parallel and thus can further reduce the runtime significantly (by ≥ 30% in experiments) if multiple cores are available. Judiciously designed synthetic and real-data experiments using a multilingual dataset are employed to showcase the effectiveness of the proposed algorithms.

0 Replies