Chernoff Information in Community Detection
Abstract: In network inference applications, it is desirable to detect community structure, i.e., cluster vertices into potential blocks. Beyond adjacency matrices, many real-world networks also involve vertex covariates that may carry information about underlying block structure. Since accurate inference on random networks depends on exploiting all available signal, we need scalable algorithms that can incorporate both network connectivity data and additional insight from vertex covariates. In addition, it can be prohibitively expensive to observe the entire graph in many real applications, especially for large graphs. Thus it becomes essential to identify vertices that have the most impact on block structure and only check whether there are edges between them given a limited budget.
To assess the effects of vertex covariates on block recovery, we consider two model-based spectral algorithms. The first algorithm uses only the adjacency matrix, and directly estimates the block assignments. The second algorithm incorporates both the adjacency matrix and the vertex covariates into the estimation of block assignments. We employ Chernoff information to analytically compare the algorithms’ performance and derive the information-theoretic Chernoff ratio for certain models of interest. Analytic results and simulations suggest that the second algorithm is often preferred: one can better estimate the induced block assignments by first estimating the effect of vertex covariates. In addition, real data experiments also indicate that the second algorithm has the advantage of revealing underlying block structure while considering observed vertex heterogeneity in real applications.
Moreover, we propose a dynamic network sampling scheme to optimize block recovery for stochastic blockmodel in the case where it is prohibitively expensive to observe the entire graph. Theoretically, we provide justification of our proposed Chernoff-optimal dynamic sampling scheme via the Chernoff information. Practically, we evaluate the performance of our method on several real datasets from different domains. Both theoretically and practically results suggest that our method can identify vertices that have the most impact on block structure so that one can only check whether there are edges between them to save significant resources but still recover the block structure.
Loading