Keywords: Interpretability, Generative Flow Networks, GFlowNets, Structured generative models, Molecular design, Drug discovery, Reaction graph modeling, Gradient-based saliency, Counterfactual analysis, Sparse autoencoders, Concept attribution, Latent representations, Trustworthy AI, Scientific machine learning, Medicinal chemistry
Abstract: Deep-generative models are increasingly applied to molecular design in drug discovery, where they explore vast
chemical spaces while respecting synthesizability constraints. SynFlowNet [1], a hierarchical Generative Flow Network
(GFlowNet), addresses this challenge by constructing molecules through sequential reaction templates and building
blocks. Yet, the internal representations and decision policies of such models remain opaque, limiting interpretability
and trust. From an ML perspective, hierarchical GFlowNets are an emerging class of structured generative models,
and their interpretability remains largely unexplored. Bridging this gap advances transparency in generative ML while
creating methods that extend beyond the chemistry domain.
We introduce a unified interpretability framework for reaction graph GFlowNets that adapts modern ML analysis tools
to scientific generative models. Our approach integrates two complementary perspectives. First, gradient-based
saliency with counterfactual analysis: we compute gradients of action log-probabilities and map them into atom-level
heatmaps. To move beyond correlation, we perturb chemical motifs with SMARTS-based masking and quantify
probability shifts, yielding both attribution maps and causal evidence for which substructures drive decisions. Second,
concept attribution in latent space: we train sparse autoencoders (SAEs) and linear probes on SynFlowNet
embeddings to uncover interpretable factors and motifs encoded by the model.
We find that SynFlowNet, when trained with QED (drug-likeness) as the reward, does not encode it in a single
latent dimension. Instead, sparse autoencoders disentangle QED into interpretable axes such as size, polarity, and
lipophilicity, which are more linearly predictable than QED itself. Linear probes accurately detect chemically
meaningful motifs (e.g., functional groups, rings, halogens), showing that domain concepts are directly recoverable.
Counterfactual analyses further improve QED optimization by identifying and altering reward-critical substructures.
By combining saliency, counterfactuals, and concept attribution, our framework offers the first toolkit for interpreting
GFlowNets in molecular design. This not only demonstrates how interpretability methods from vision and language
can be extended to structured generative models in ML, but also provides actionable insights for medicinal chemists,
helping bridge model behavior to chemical reasoning and accelerating drug discovery .
[1] Miruna Cretu et al., SynFlowNet: Design of Diverse and Novel Molecules with Synthesis Constraints, arXiv preprint arXiv:2024.
Submission Number: 177
Loading