An Efficient Search-and-Score Algorithm for Ancestral Graphs using Multivariate Information Scores for Complex Non-linear and Categorical Data

Nikita Lagrange; Herve Isambert

An Efficient Search-and-Score Algorithm for Ancestral Graphs using Multivariate Information Scores for Complex Non-linear and Categorical Data

Nikita Lagrange, Herve Isambert

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-ND 4.0

TL;DR: We propose an efficient search-and-score algorithm for ancestral graphs based multivariate information scores

Abstract: We propose a greedy search-and-score algorithm for ancestral graphs, which include directed as well as bidirected edges, originating from unobserved latent variables. The normalized likelihood score of ancestral graphs is estimated in terms of multivariate information over relevant "$ac$-connected subset" of vertices, $\boldsymbol{C}$, that are connected through collider paths confined to the ancestor set of $\boldsymbol{C}$. For computational efficiency, the proposed two-step algorithm relies on local information scores limited to the close surrounding vertices of each node (step 1) and edge (step 2). This computational strategy, although restricted to information contributions from $ac$-connected subsets containing up to two-collider paths, is shown to outperform state-of-the-art causal discovery methods on challenging benchmark datasets.

Lay Summary: The likelihood function is a fundamental concept in machine learning, quantifying how "likely" a given model explains observed data. Consequently, selecting the model that maximizes likelihood provides the most plausible explanation for the data, when no prior information about possible models is available. Typically, identifying the best explanatory model involves maximizing likelihood across a set of candidate models. For directed acyclic graph (DAG) models—structures where variables are represented as nodes connected by directed edges without forming cycles—the global likelihood function conveniently decomposes into local likelihood terms, involving each observed variables and its parent nodes. However, in practice, all relevant variables might not be observed in the dataset. This paper addresses this limitation by extending the likelihood formulation to handle DAGs with unobserved variables. Such hidden variables introduce edges with two arrowheads, indicating an unobserved common cause between observed variables. We show that the likelihood for these generalized "ancestral graphs" similarly decomposes into local contributions involving specific subsets of observed variables, and propose an estimation of these local likelihood contributions directly from observed data. We also introduce an efficient search and score algorithm, which does not assume simple linear relations between variables (like most other state-of-the-art methods) thereby providing a causal discovery method with hidden variables for complex non-linear and categorical data, that are common in `real-word' applications.

Link To Code: https://github.com/miicTeam/miicsearchscore

Primary Area: Probabilistic Methods->Structure Learning

Keywords: causal discovery, search-and-score structure learning, latent variable, multivariate information

Submission Number: 10345

Loading