%\documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{mathtools} % amsmath with fixes and additions
\usepackage{amssymb}
\usepackage{bbm}
\usepackage{balance}
\usepackage{algorithm}
\usepackage[noend]{algpseudocode}   
\usepackage{pgfplots}
\usepackage{amsfonts}
\usepackage{subcaption}
\usepackage{float}
\usepackage{enumitem}
\let\proof\relax
\let\endproof\relax
\usepackage{amsthm} %http://ctan.org/pkg/amsthm
\newtheorem{theorem}{Theorem}
\newtheoremstyle{exampstyle}
  {\topsep} % Space above
  {\topsep} % Space below
  {} % Body font
  {} % Indent amount
  {\bfseries} % Theorem head font
  {.} % Punctuation after theorem head
  {.5em} % Space after theorem head
  {} % Theorem head spec (can be left empty, meaning `normal')
\theoremstyle{exampstyle} \newtheorem{example}{Example}
\theoremstyle{exampstyle} \newtheorem{remark}{Remark}
\theoremstyle{exampstyle} \newtheorem{definition}{Definition}
\theoremstyle{exampstyle} \newtheorem{lemma}{Lemma}
\theoremstyle{exampstyle} \newtheorem*{lemma*}{Lemma}
\renewcommand{\qedsymbol}{}

% LINQS
\usepackage{macros}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Learning Explainable Templated Graphical Models}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<vembar@ucsc.edu>?Subject=Learning Explainable Templated Graphical Models}{Varun Embar *\thanks{Equal contribution}}{}}
\author[1]{Sriram Srinivasan *}
\author[1]{Lise Getoor}
% Add affiliations after the authors
\affil[1]{%
    Dept. of Computer Science and Engineering \\
    University of California, Santa Cruz \\
    USA
}

  \begin{document}
\maketitle

\begin{abstract}
     Templated graphical models (TGMs) encode model structure using rules that capture recurring relationships between multiple random variables.
    While the rules in TGMs are interpretable, it is not clear how they can be used
    to generate explanations for the individual predictions of the model.
    Further, learning these rules from data comes with high computational costs: it typically requires an expensive combinatorial search over the space of rules and repeated optimization over rule weights.
    In this work, we propose a new structure learning algorithm,    \emph{Explainable Structured Model Search (\SL)}, 
   that learns a templated graphical model and an explanation framework for its predictions.
    \SL~ uses a novel search procedure to efficiently search the space of models and discover models that trade-off predictive accuracy and explainability.
    We introduce the notion of \textit{relational stability} and prove that our proposed explanation framework is stable.
    Further, our proposed piecewise pseudolikelihood (PPLL) objective does not require re-optimizing the rule weights across models during each iteration of the search.
    In our empirical evaluation on three realworld datasets, we show that our proposed approach not only discovers models that are explainable, but also significantly outperforms existing state-of-the-art structure learning approaches.
\end{abstract}


\section{Introduction}
\label{sec:introduction}
Templated graphical models (TGMs), a class of probabilistic graphical models that are represented by parameterized potential functions, often use rules or probabilistic constraints to define the model.
The templates encode the probabilistic dependencies between random variables (RVs) and are instantiated many times within the model \citep{koller:book09}.
TGMs have been successfully applied in many domains including computational biology\citep{segal:bio01}, knowledge base completion\citep{jiang:icdm12}, text mining\citep{beltagy:acl14} and computer vision \citep{aditya:uai18}.
Learning the components of these models (rules and constraints) directly from the data is known
as \textit{structure learning} \citep{kok:icml09, khot:icdm11, mihalkova:icml07}.
However, it
poses
several computational challenges.
First, the model space is potentially infinite and, even when restricted to be finite, results in a 
large 
combinatorial search.  
Second, approaches that iteratively grow a set of rules require 
many
costly rounds of parameter estimation.
Finally, scoring the model often involves computing the model likelihood, which is typically intractable to evaluate exactly.

In addition to predictive performance, there is a growing interest in generating explanations
\citep{wang:chi19,adadi:ieee18,zhao:uai21,watson:uai21}.
Models that provide explanations lead to increased user trust and have also been shown to be more persuasive \citep{tintarev:icdew07,  ribeiro:kdd16, alvarez:nip18, doshi:17, zhang:sigir14, wang:sigir18}.
Explanations can also help isolate and identify incorrect assumptions and biases learned by the model.
While TGMs are more interpretable than other large graphical models, generating explanations for individual predictions that satisfy certain desired properties is still challenging.
Further, not all rules that are included in the model can be explained to the end user.
When learning a model from the data, there 
may be a need to trade-off accuracy and end-user explainability.

In this work, we propose a novel approach, \emph{explainable structured model search} (\SL), that learns an explainable templated graphical model automatically from data.
Our proposed approach leverages probabilistic soft logic (PSL)\citep{bach:jmlr17}, a TGM defined using a set of weighted first-order logic rules.
Unlike other TGMs that use Boolean logic, PSL uses Lukasiewicz logic, a continuous relaxation of Boolean logic, and can incorporate real-valued data such as similarity metrics and confidence scores.
Our \SL~approach searches the model space effectively using meta templates that capture common rule patterns.
Our proposed structure learning approach utilizes an efficient weight learning strategy that minimizes the need to re-optimize 
rule weights across models during the search.
We also 
introduce an effective learning objective for PSL that assigns importance weights for rules and eliminates non-informative rules.
Our approach uses an explanability score that biases the search to learn explainable models.
After learning a model, we generate explanations for each of the predictions using these rules.
The use of human-interpretable rules ensures our explanations satisfy the property of \textit{explicitness} \citep{alvarez:nip18}. 
The continuous nature of inferred values allows us to identify the "true" explanations, a property called \textit{faithfulness}.
Further, we extend the property of \textit{stability} for the relational setting and show that the proposed explanation strategy is stable.

The main contributions of our work include:
1) We propose a novel structured search approach that efficiently discovers a templated graphical model using meta templates that best capture the statistical dependencies in the data; 2) We introduce an efficient weight learning strategy based on a piecewise pseudolikelihood objective that allows parallelization and requires weights for a meta template to be learned only once across models; 3) Using an explainabilty parameter, our learning approach generates models that trade-off  accuracy and end-user explainability of its predictions; 4) We propose a new Fisher score-based ranking algorithm that identifies the best explanation for a prediction and theoretically show that this is stable; and 
5) We empirically show that the discovered models using our proposed approach outperform models generated using state-of-the-art methods.

\section{Related Work}
\label{sec:related}
Our approach builds on a large body of existing work. Here, we give a brief overview of structure learning in templated graphical models and related work in explainability.

\textbf{Structure Learning:} 
Many algorithms have been proposed to learn Markov Logic Networks \citep{richardson:mlj06}, a class of discrete TGMs.
Bottom-up approaches generate informative clauses by using relational paths to capture patterns and motifs in the data \citep{mihalkova:icml07,kok:icml09,kok:icml10}.
Most recently, MLN structure learning has been viewed from the perspectives of moralizing learned Bayesian networks \citep{khosravi:aaai10} and functional gradient boosting \citep{khot:icdm11}. These methods improve scalability while maintaining predictive performance.
Structure learning methods specific to a task of interest use inductive logic programming \citep{muggleton:ilp91} to generate clauses which are pruned with L1-regularized learning \citep{huynh:icml08,huynh:ecml11} or perform iterative local search \citep{biba:ilp08} to refine rules with the operations described above.
For PSL, a reinforcement learning based approach has been proposed \citep{zhang:ijcai19}.
Our approach builds on these approaches and the concept of meta templates \citep{rocktaschel:nips17,wang:ijcnlp15,weber:acl19} to learn an model and also generates explanations.
	
\textbf{Explainability:} 
Explainable AI (XAI) is fast-growing area of research \citep{ehsan:chi21,arrieta:if20,gade:kdd19}.
Explainable models can be broadly classified into model-intrinsic methods and model-agnostic methods. 
Model-intrinsic approaches such as \citet{catherine:recsys16,kouki:iui19,al:jmlr20} use interpretable models that are easy to explain.
Model-agnostic or post-hoc explanations such as \citet{ribeiro:kdd16, peake:kdd18,yang:icdm18} consider the model as a black box and generate explanations from the output.
Our proposed approach is a model-intrinsic method that learns an interpretable PSL model.
Several gradient-based and perturbation-based explanations have been proposed by \citet{bach:plos15,zeiler:eccv14,shrikumar:icml17,wolf:aaai2019} for deep learning models. 
\citet{sundararajan:icml17} proposed the notion of integrated gradients that satisfy the properties of sensitivity and implementation invariance.
In this work, we propose a similar approach for templated graphical models.

\section{Background}
\label{sec:background}
Probabilistic soft logic is a TGM that defines a hinge-loss Markov random field (HL-MRF) \citep{bach:jmlr17}. 
The templates are weighted logical rules that encode statistical dependencies and structural constraints.
HL-MRFs support modeling of multi-relational data and use a continuous relaxation of discrete logic  to generate continuous RVs in the range [0,1].
This allows PSL to incorporate information such as similarity measures.
This also makes inference of unobserved RVs efficient and scalable, which is crucial for large scale probabilistic reasoning.
PSL has been used successfully in several domains including natural language processing \citep{beltagy:acl14}, social media analysis \citep{johnson:coling16, ebrahimi:emnlp16} and recommender systems \citep{kouki:recsys15}.
As an example, consider the following rule present in a typical recommender system. 
\begin{align*}
    w: \pslpred{SimItem}(\pslarg{I_1}, \pslarg{I_2}) \land \pslpred{Rating}(\pslarg{U}, \pslarg{I_1}) \implies \pslpred{Rating}(\pslarg{U}, \pslarg{I_2}) 
\end{align*}
The rule suggests similar items are rated similarly. 
Here, $\pslpred{SimItem}$ is a \textbf{predicate} that encodes the similarity between two items $\pslarg{I_1}$ and $\pslarg{I_2}$, the predicate $\pslpred{Rating}$ encodes the rating assigned to the item by the user $\pslarg{U}$ and $w$ denotes the \textbf{weight} of the rule which determines its importance.
The \textbf{variables} $\pslarg{I_1}, \pslarg{I_2}, \pslarg{U}$ range over the \textbf{constants} in a domain.
The number of variables in a predicate is called the \textbf{arity} of the predicate.
The predicate together with the list of variables is called an \textbf{atom}.
The set of predicates under consideration is denoted by $\mathbf{P}$.
Given a set of users 
\{Alice, Bob\} and 
and movies
\{Legend, Taps\}, PSL generates \textbf{ground rules} by substituting variables in the rules with constants. An example of a ground rule is as follows: 
\begin{align*}
\begin{split}
    w: \pslpred{SimItem}& (Legend, Taps)\land 
    \pslpred{Rating}(Alice, Legend) \\ &\implies \pslpred{Rating}(Alice, Taps) 
\end{split}
\end{align*}
The atoms in a ground rule are called \textbf{ground atoms} (e.g. \pslpred{SimItem}(Legend, Taps)).
A \textbf{PSL Model} (denoted by $\mathbf{M}$) is a set of weighted rules $\{r_1, r_2, \cdots, r_n \}$.  
Using the model $\mathbf{M}$ and a set of ground atoms, PSL generates a HL-MRF.
PSL associates a RV with each ground atom.
RVs with observed values are called \textbf{observed RVs} ($\mathbf{X}$) and those with unobserved values are called \textbf{unobserved RVs} ($\mathbf{Y}$).
These unobserved RVs correspond to the target predicates whose values we wish to infer.
For the ground rule mentioned above, let $X_1, Y_1, Y_2$ be the RVs associated with the ground atoms $\pslpred{SimItem}(Legend, Taps)$, $\pslpred{Rating}(Alice, Legend)$, $\pslpred{Rating}(Alice, Taps)$.
Then each grounded rule is mapped to a hinge-loss potential $\phi$ using Lukasiewicz logic.
For the ground rule mentioned above the hinge-loss potential is given by $ \phi(\mathbf{Y}, \mathbf{X}) = max \{X_{1} + Y_{1} - Y_{2} - 1, 0\}^p $.
In this work we consider $p=2$, which results in squared hinge-loss potentials.

Given the set of observed and unobserved RVs $\mathbf{X,Y}$, and the set of potentials $\mathbf{\Phi}$, PSL defines a probability distribution $\mathbf{X}$ as follows:
\begin{equation}
\label{eq:hlmrf}
\begin{split}
P(\mathbf{Y}|\mathbf{X}) = & \frac{1}{Z(\mathbf{X})} exp(-E(\mathbf{Y},\mathbf{X})) \\
where\ E(\mathbf{Y},\mathbf{X}) = &\sum_j \mathbf{w}_j \mathbf{\Phi}_j(\mathbf{Y}, \mathbf{X})\\
Z(\mathbf{X}) = \int_\mathbf{Y} & exp(-E(\mathbf{Y},\mathbf{X}))  \\
\end{split}
\end{equation}
Here, $j$ iterates over all the ground rules, and $\mathbf{w}$ gives the rule weights.
The function $E$ is called the \textbf{energy function}.

\section{Explainable Templated Graphical Models}
\label{sec:problem_definition}
Explanations are human-understandable artifacts that provide qualitative understanding of the relationship between the data, the model's internal state, and the predictions \citep{ribeiro:kdd16, wolf:aaai2019}.
Explanations can either be generated a posteriori, where the model is viewed as a black box, or  generated by the model internally along with its predictions.
A good explanation must satisfy three properties: \textit{explicitness}, \textit{faithfulness} and \textit{stability} \citep{alvarez:nip18}.
Explicitness means that the generated explanation is interpretable by the user.
A faithful explanation implies that the generated explanation is relevant to the prediction.
Finally, stability means that the generated explanation does not change drastically for small changes in the input features.
The predictions in a TGM depend on the ground rules present in the model.
Since these ground rules are human-interpretable, they can be used as explanations.

In a non-relational setting, an explanation is typically a function of the input features.
In the relational setting, the generated explanations depend on other observed and unobserved RVs. 
A stable explanation 
should
not change drastically when the values of other RVs change.
We refer to this as \textit{relational stability}.
We formally define this by extending the framework in  \citet{wolf:aaai2019} to a relational setting.

Let $M$ be a model that predicts the values for the unobserved RVs $\mathbf{Y}$ given the observed RVs $\mathbf{X}$, denoted by $M(\mathbf{X},\mathbf{Y})$. 
For example, in PSL, the model infers the values of $\mathbf{Y}$ by identifying the mode of the distribution,
e.g., 
$M(\mathbf{X},\mathbf{Y}) = \arg \max_{\mathbf{Y}} P(\mathbf{Y}|\mathbf{X})$.
Let $\mathbf{G_i}$ denote the set of possible explanations for a RV $\mathbf{Y_i}$.
\begin{definition}
\textbf{Explaining function:} An explaining function, denoted by $f$, produces an importance score of an explanation in $\mathbf{G_i}$ for the inferred value of $\mathbf{Y_i}$. 
\end{definition}

\begin{definition}
\textbf{Relational Stability}: Let $M$ be a model and $f$ be an explaining function. Let $\mathbf{X,Y}$ be the set of observed and unobserved RVs and $\mathbf{G_i}$ be the set of possible explanations for the RV $\mathbf{Y_i}$.
We say that $f$ is stable with respect to $M$, if for any two $\mathbf{X_1}, \mathbf{X_2}$ that differ in a single RV $X_k$ by at most $\epsilon$, $\exists \delta \in \mathbbm{R}$ such that: 
 \begin{equation}
	 \begin{split}
	 \forall \mathbf{i}\forall g \in \mathbf{G_i}, &|f(\mathbf{X_1}, M(\mathbf{X_1}, \mathbf{Y}), g) - f(\mathbf{X_2}, M(\mathbf{X_2}, \mathbf{Y}), g)| \\
		 & \leq \delta
	 \end{split}
 \end{equation}
\end{definition}
The above definition states that the explaining function score for every explanation across predictions do not vary 
a lot 
when the value of one of the observed RVs is changed by a small value. 

Having defined relational stability, we now define the task of learning explainable templated graphical models.
\begin{definition} \textbf{Learning explainable templated graphical models:}
Given a set of predicates $\mathbf{P}$ along with a target predicate $P_T \in \mathbf{P}$ that we need to infer, the task of learning explainable templated graphical model involves two subtasks:
1) The {\bf structure learning} subtask involves discovering a templated model $\mathbf{M}$ that is then used to infer the values of $\mathbf{Y}$ that belong to the predicate $P_T$,  and 2) The {\bf explanation} subtask involves generating and ranking the explanations for each of the inferred values of $\mathbf{Y}$ using the explanation function $f$ that satisfies the three properties of explicitness, faithfulness and relational stability.
\end{definition}

\section{Learning Explainable Templated Graphical Models}
\label{sec:sl}
Learning an explainable TGM directly from data poses three main challenges.
First, even after restricting the rule length and the size of the model, it involves a combinatorial search and the possible set of models is very large.
Second, the search over the space of models involves estimating the weights of the rules many times, which is costly.
Finally, not all predicates may be interpretable by the end-user.

To overcome these challenges, we introduce the notion of a \emph{meta template} and  propose a novel likelihood function, piecewise-pseudologlikehood (PPLL), to learn the weights of the inferred rules.
We also incorporate an \textit{explainabilty bias} that learns a more interpretable model.

\subsection{Meta template}
Meta templates guide the search by capturing common statistical relational patterns present in the data across a wide range of domains.
Further, they restrict the search space by ensuring that the domains and ranges of the predicates are taken into consideration.
The concept of a meta template has been proposed for tasks such as predicate learning \cite{muggleton:mlj15}, information and relation extraction \citep{wang:acl15}, question answering\citep{weber:acl19} and in Neural Theorem Provers\citep{rocktaschel:nips17}.

\begin{definition}
\textbf{Meta template:} A meta template has slots in place of predicates and encodes the variable bindings between the predicates. Filling the slots with predicates results in a rule.
\end{definition}
Consider the following  meta template that can be used to combine or fuse information from multiple sources: 
$\blank(\pslarg{A}, \pslarg{B}) \implies \pslpred{P_T}(\pslarg{A}, \pslarg{B})$.
Here, $\blank$ is a slot that can be filled by a predicate that has the same domain and range as the target predicate. 
For example, in a hybrid recommender system\citep{kouki:recsys15}, we can incorporate the outputs of standalone recommender systems such as non-negative matrix factorization ($\pslpred{NMF}$) and collaborative filtering($\pslpred{CF}$) into our model using this meta template.  
The rule generated by filling the slot with $\pslpred{NMF}$ is given by $\pslpred{NMF}(\pslarg{U}, \pslarg{I}) \implies \pslpred{Rating}(\pslarg{U}, \pslarg{I}) $.

We propose four meta templates that capture a wide variety useful patterns in relational domains. 
Additional meta templates that generate domain-specific rules can also be incorporated into our approach.

\noindent \textbf{Path Template:} The path template is the most common meta template and can capture relational patterns such as transitivity. 
Each slot in the template must be filled with a predicate of arity two. A path template of size two has the following structure: $\blank(\pslarg{A}, \pslarg{B}) \psland \blank(\pslarg{B}, \pslarg{C}) \implies \pslpred{P_T}(\pslarg{A}, \pslarg{C})$
For example, the notion of triadic closure used in social network analysis can be generated from the path template and is given by: $\pslpred{Friends}(\pslarg{A}, \pslarg{B}) \psland \pslpred{Friends}(\pslarg{B}, \pslarg{C}) \implies \pslpred{Friends}(\pslarg{A}, \pslarg{C})$.
Similarly, path templates of size three and higher can be defined.

\noindent \textbf{Similarity Template:} The similarity template captures the relationship between multiple target instances.
Each slot in the template must be filled with a predicate of arity two and has the following structure: $\blank(\pslarg{A}, \pslarg{B}) \psland \blank(\pslarg{C}, \pslarg{A}) \implies \pslpred{P_T}(\pslarg{C}, \pslarg{B})$
For example, similarity functions used in collaborative filtering can be generated from this template and is given by: $\pslpred{SimilarItem}(\pslarg{I_1}, \pslarg{I_2}) \psland \pslpred{Rating}(\pslarg{U}, \pslarg{I_1}) \implies \pslpred{Rating}(\pslarg{U}, \pslarg{I_2})$.

\noindent \textbf{Local Template:} The local template can integrate information from multiple sources and has the following three structures: $\blank(\pslarg{A}, \pslarg{B}) \implies \pslpred{P_T}(\pslarg{A}, \pslarg{B}) ; 
\blank(\pslarg{B}) \implies \pslpred{P_T}(\pslarg{A}, \pslarg{B}) ;
\blank(\pslarg{A}) \implies \pslpred{P_T}(\pslarg{A}, \pslarg{B}) $
In addition to our earlier hybrid recommender example, consider the case of fusing multiple classifiers such as $\pslpred{RandomForest}$ and $\pslpred{NeuralNetworks}$ for the task of entity resolution. We could incorporate them into our model by rules such as: 
$\pslpred{RandomForest}(\pslarg{U_1}, \pslarg{U_2}) \implies \pslpred{SamePerson}(\pslarg{U_1}, \pslarg{U_2}) $

\noindent \textbf{Prior Template:} For targets where we have no information, 
we typically want to encode some prior information.
This is captured by the prior template and has the following form: 
$\pslpred{\mathbf{P}_T}(\pslarg{A}, \pslarg{B}) = \{0,1\}$
By setting different weights to these rules, we can vary the prior value for targets in the range $[0,1]$. 

\subsection{Piecewise pseudolikelihood}
In addition to the rules,
we also need to learn the relative weights of these rules in a PSL model.
One approach to weight learning involves optimizing the likelihood function.
However, the partition function $Z$ in likelihood involves an integration that makes it intractable to compute.
To overcome the intractable likelihood score, pseudo-likelihood \citep{besag:jrss75} is commonly used by weight learning methods.
For HL-MRFs, the pseudo-likelihood approximates the likelihood as:
\begin{equation}
\label{eq:pseudolikelihood}
\begin{split}
P(\mathbf{Y}|\mathbf{X}) = \prod_{Y_i \in \mathbf{Y}} & \frac{1}{Z_i(\mathbf{Y_{-i}}, \mathbf{X})} \exp( -E_i(\mathbf{Y}, \mathbf{X}))  \\
\text{where}\ E_i(\mathbf{Y}, \mathbf{X})& = \sum_{j:Y_i \in \mathbf{\Phi}_j} \mathbf{w}_j \mathbf{\Phi}_j (\mathbf{Y},\mathbf{X}) \\
 Z_i(\mathbf{Y_{-i}}, \mathbf{X}) &= \int_{Y_i} \exp(-E_i( \mathbf{Y},\mathbf{X}))
\end{split}
\end{equation}
The notation $j:Y_i \in \mathbf{\Phi}_j$ selects ground rules where $Y_i$ appears.
However, due to the coupling of the rules, we also need to re-estimate the weights for the same rule in different models. 
Further, the objective function is non-convex and is hard to optimize.

To overcome these challenges, we propose to use the efficient-to-optimize objective function called \textbf{piecewise pseudolikelihood} (PPLL). 
PPLL has two key properties that makes weight learning highly scalable : 1) with PPLL, the optimal weight of a rule is independent of other rules in the model; and 2) the PPLL objective is convex and admits an inherently parallelizable gradient-based algorithm for optimization.

PPLL was first proposed for weight learning in conditional random fields (CRF) \cite{sutton:icml07}.
For HL-MRFs, PPLL factorizes the joint conditional distribution along both RVs and rules and is defined as:
\begin{align}
\begin{split}
\label{eq:piecewiselikelihood}
P(\mathbf{Y} | \mathbf{X}) = \prod_{r \in M} \prod_{Y_{i} \in \mathbf{Y}} & \frac{1}{Z_{i}^{r}(\mathbf{Y_{-i}},\mathbf{X})}\exp(-E^r_{i}(\mathbf{Y}, \mathbf{X}))  \\
\text{where}\ E_{i}^{r}(\mathbf{Y}, \mathbf{X}) &= \sum_{j:Y_{i} \in \mathbf{\Phi}^{r}_{j}} \mathbf{w}_{j}\mathbf{\Phi}_{j}(\mathbf{Y}, \mathbf{X})\\
Z_{i}^{r}(\mathbf{Y_{-i}}, \mathbf{X}) &= \int_{Y_{i}} \exp(-E_{i}^{r}(\mathbf{Y}, \mathbf{X})) 
\end{split}
\raisetag{30pt}
\end{align}

The notation $j:Y_i \in \mathbf{\Phi}^{r}_j$ selects ground rules generated from rule $r$ and has $Y_i$.
The key advantage of PPLL over likelihood arises from the factorization of $Z$ into $Z_i^r$, which requires only ground rules corresponding to rule $r$ and variable $Y_i$ for its computation.
Following standard convention, we optimize the log of PPLL denoted $l_{ppll}(\mathbf{w})$.

We now show that for the log PPLL objective function, performing weight learning on the entire model containing all rules is equivalent to optimizing the weight for each rule independently. 
\begin{lemma}
\label{theorem:decompose}
Optimizing $l_{ppll}(\textbf{w})$ over the set of weights $\textbf{w}$ is equivalent to optimizing over each $\mathbf{w}_r$ separately.
\end{lemma}
\begin{proof}
By the definition of $l_{ppll}(\textbf{w})$, we have
\begin{align*}
&\arg \max_{\textbf{w}}  l_{ppll}(\textbf{w})\\
&= \arg \max_{\textbf{w}} \sum_{r \in M} \sum_{Y_i \in \mathbf{Y}} -E^r_{i}(\mathbf{Y}, \mathbf{X}) - log Z_{i}^{r}(\mathbf{Y_{-i}},\mathbf{X}) \\
				&= \sum_{r \in M} \arg \max_{\mathbf{w}_r} \sum_{Y_i \in \mathbf{Y}} -E^r_{i}(\mathbf{Y}, \mathbf{X}) - log Z_{i}^{r}(\mathbf{Y_{-i}},\mathbf{X})\\
				&= \arg \max_{\mathbf{w}_r} \sum_{Y_i \in \mathbf{Y}} -E^r_{i}(\mathbf{Y}, \mathbf{X}) - log Z_{i}^{r}(\mathbf{Y_{-i}},\mathbf{X}) \forall r \in M 
			   \qedhere
\end{align*} 
 \end{proof}

We optimize $l_{ppll}(\mathbf{w})$ using a projected gradient descent algorithm.
The gradient for a rule weight $\mathbf{w}_r$ turns out to be the difference between observed and expected hinge-loss potential summed over corresponding ground rules $\mathbf{\Phi}^r$. 
We can compute observed penalties once and cache their values. Unlike the gradients for likelihood, each expectation term in the PPLL gradient considers a single rule. 
Thus, when evaluating gradients for weight updates, we use multi-threading to compute the expectation terms in parallel. 
The dual advantages of parallelizing and requiring weight learning only once for a rule makes PPLL highly scalable.

\subsection{Explainability Bias}
Having introduced key components of our structure search, we next turn to
explanability. 
Some predicates are explainable and other are not.
As an example, in a recommender system, rules containing predicates such as $\pslpred{SimUser_{Cosine}}$ can be explained using sentences such as ``\textit{User $U_1$ who is similar to you liked this item $I$}''. 
Other predicates such as latent factor recommendation approaches may be harder to explain to the end-user.
We partition the predicates into explainable and non-explainable predicates.
Because explanabilty can be subjective, our approach is flexible, and partitions can be tuned to what seems natural at either the domain level, or even for a particular user.  
Given a partition, we formally define end-user explainability of a rule as:
\begin{definition} [$\alpha$-explainable]
A rule $r$ is $\alpha$-explainable if the proportion of explainable predicates in the body of the rule is greater than $\alpha$.
\end{definition}
Therefore, if a rule has no end-user explainable predicates in the body then it is a non-explainable (0-explainable) rule and if every predicate in the body of a rule is end-user explainable then it is a fully explainable (1-explainable) rule.

In applications where providing meaningful explanations to the end user is important, we may prefer models with many $\alpha$-explainable rules. A model with many $\alpha$-explainable rules can result in a greater number of predictions that are explainable.
However, this might result in a loss of predictive accuracy.
To address this trade-off at the model discovery time, we introduce an explainability bias parameter $\gamma \in [0,1]$ which is the minimum proportion of rules in a model that are explainable and tune it based on the application's need.

\subsection{Explainable Structured Model Search}
\algoref{algo:search} outlines our proposed \SL~algorithm.
For each rule in the model, we first sample a template.
We then sample predicates for each slot in the template. 
We add all $\alpha$-explainable rules to the model, and with probability $1-\gamma$, we add non-$\alpha$-explainable rules rule to the model.
A value of 1 for $\gamma$ and $\alpha$ ensures every rule in the model only contains predicates that are explainable and hence all predictions can be explained.
This ensures that the generated explanations satisfy the property of \textit{explicitness}.
Once all the rules in the model are sampled, we learn the relative importance of these rules by performing weight learning using PPLL.
We then evaluate the performance of the model $V(M)$ on the training data.
We repeat this process $N$ times and return the best performing model as the final model.

\begin{algorithm}[!t]
	\caption{\textbf{Explainable Structured Model Search (\SL)}}
	\label{algo:search}
	\algrenewcommand\algorithmicrequire{\textbf{Input:}}
	\algrenewcommand\algorithmicensure{\textbf{Output:}}
	\begin{algorithmic}
		\Require $T$: Rule templates; $L_{M}$: Max rules; $N$: max iterations; \\ $P$: Set of predicates; $\gamma$: Explainability parameter;
		\Ensure $M^{*}$: Explainable model
    	\State $score_{best} \gets -\infty$
    	\For {$i \in 1\ to \ N$}
    	    \State $l_M \gets 0$
		    \State $M \gets \phi$ 
    		\While{$l_{M} < L_{M}$}
    		    \State $r \gets Generate Rule(T, P, \gamma)$
    		    \State $M \gets M \cup r$
    			\State $l_{M} += 1$
    		\EndWhile
    		\State $\mathbf{w} \gets \argmax_{\mathbf{w}} l_{ppll}(\mathbf{w})$
			\If{ $V(M) > score_{best}$}
				\State $M^{*} \gets M$
				\State $score_{best} \gets V(M)$
			\EndIf
		\EndFor
	    \State \Return $\mathbf{M}^{*}$
	\end{algorithmic}
\end{algorithm}
\begin{algorithm}[!t]
	\caption{\textbf{Generate Rule(T, P, $\gamma$)}}
	\label{algo:rule_gen}
	\algrenewcommand\algorithmicrequire{\textbf{Input:}}
	\algrenewcommand\algorithmicensure{\textbf{Output:}}
	\begin{algorithmic}
		\Require $T$: Rule templates; $P$ Set of predicates; $\gamma$: Explainability parameter
		\Ensure $r$: a rule
		\State RuleFlag $\gets$ False
		\While{RuleFlag is False}
		    \State $t \sim Unif(T)$
		    \For {Slot $s$ in $t$}
		        \State Sample $p \in P$ that satisfies domain and range constraints of the variables.
		        \State $r(s) \gets p$
		    \EndFor
		    \If {$r$ is $\alpha$-explainable}
		        \State RuleFlag $\gets$ True
		    \Else 
		        \State $g \sim Unif([0,1])$
		        \If {$g \geq \gamma$}
		            \State RuleFlag $\gets$ True
		        \EndIf
		    \EndIf
		 \EndWhile
		 \State Return $r$
	\end{algorithmic}
\end{algorithm}

\section{Generating Explanations}
\label{sec:explanation}
 We now describe our approach to generate explanations for the PSL model's predictions on new data, after we have learned a model using the \SL~ approach.
The unobserved values are inferred by maximizing the likelihood of the graphical model. 
The values of the unobserved target RVs $\mathbf{Y}$ depend on all the %hinge-loss clique potentials 
ground rules
they are present in.
We can either display these ground rules directly to the user or use a translation system, that takes as input a ground rule and outputs sentences in natural language or pictorially as described in \citet{kouki:iui19}.
Thus, the set of explanations for a target RV $\mathbf{Y_i}$ (denote by $\mathbf{G_i}$) is given by $\{\phi: \phi \in \mathbf{\Phi} \land \mathbf{Y_i} \in \phi\}$.

However, this set is usually large and not all are explanations are equally important.
To ensure \textit{faithfulness}, we measure the importance of each ground rule to the inferred value and display the most important rule to the user.
We define an explaining function $f$ to score the the importance of ground rules.
\begin{definition}
The explaining function $f: (\mathbf{X}, \mathbf{Y}, \phi) \rightarrow \mathbbm{R}$ scores the importance of a ground rule $\phi \in \mathbf{G_i}$ with respect to a RV $\mathbf{Y_i}$. 
It is given by the norm of the first partial derivative of the ground rule at the inferred value $y$, i.e: $    f(\mathbf{X}, \mathbf{Y},\phi) = \norm{\frac{w\partial \phi(\mathbf{X,Y})}{\partial \mathbf{Y_i}}|_{y}} $
\end{definition}
Unlike other gradient-based approaches such as integrated gradients \cite{sundararajan:icml17} where it can be challenging to prove stability, we show in the next subsection that our approach is stable as defined in \secref{sec:problem_definition}.

\subsection{Stability of the explanation function}
We first observe that the energy function $E$ is a summation of squared hinges and hence $E$ is convex.
Further, the prior template described in \secref{sec:sl} acts a regularizer of $\mathbf{Y}$ and hence $E$ is strongly convex. This was also noted by \citet{london:jmlr16}.

We state two lemmas that show the change in the optimal energy function is bounded when the value of one of the observed RV ($\mathbf{X}$) is changed (\lemmaref{lemma:energy}) and this bounds the change is the values of the unobserved RVs $\mathbf{Y}$s (\lemmaref{lemma:energy_bound}). 
The proofs for these lemmas are given in the supplementary material.
\begin{lemma}
For a graphical model $G$ with a set of potentials $\mathbf{\Phi}$, let $Q_i$ denote the number of potentials that involve $\mathbf{X_i}$, and let $Q_G \triangleq \max_{i} Q_i$. Let $\norm{\mathbf{w}} < R$. Let $\mathbf{X}, \mathbf{X'} \in \mathcal{X}$  differ at a single coordinate $i$ by at most $\epsilon$. Then, for $\dot{\mathbf{Y}} \triangleq \argmin_{\mathbf{Y}} E(\mathbf{Y}, \mathbf{X})$ and $\dot{\mathbf{Y'}} \triangleq \argmin_{\mathbf{Y}} E(\mathbf{Y}, \mathbf{X'})$,
$ \norm{E(\dot{\mathbf{Y'}}, \mathbf{X}) - E(\dot{\mathbf{Y'}}, \mathbf{X'})} \leq \epsilon R\sqrt{Q_G} $
\label{lemma:energy}
\end{lemma}

\begin{lemma}
Let $E: (\mathcal{Y},\mathcal{X}) \rightarrow \mathbb{R}$ be $\kappa$-strongly convex, and let  $\dot{\mathbf{Y}} \triangleq \argmin_{\mathbf{Y}} E(\mathbf{Y}, \mathbf{X})$ and $\dot{\mathbf{Y'}} \triangleq \argmin_{\mathbf{Y}} E(\mathbf{Y}, \mathbf{X'})$, where $\mathbf{X}, \mathbf{X'} \in \mathcal{X}$  differ at a single RV $\mathbf{X_i}$. Then,
$ \norm{\dot{\mathbf{Y'}} - \dot{\mathbf{Y}}}^2 \leq \frac{2}{\kappa}\norm{E(\dot{\mathbf{Y'}}, \mathbf{X}) - E(\dot{\mathbf{Y'}}, \mathbf{X'})} $.
\label{lemma:energy_bound}
\end{lemma}
We now state a lemma that shows that the change in the explaining function score for a ground rule $\phi \in \mathbf{G_i}$ denoted by $f(\mathbf{X}, \mathbf{Y}, \phi)$ is bounded.
\begin{lemma}
\label{lemma:fisher}
For an explanation $\phi \in \mathbf{G_i}$, let the explaining function $f$ be defined as $f(\mathbf{X}, \mathbf{Y}, \phi) = \norm{\frac{w\partial \phi(\mathbf{X,Y})}{\partial \mathbf{Y_i}}|_{y}}$. Let $\mathbf{X}, \mathbf{X'} \in \mathcal{X}$ differ at a single RV $\mathbf{X_i}$ by at most $\epsilon$. Let $\norm{\mathbf{Y}- \mathbf{Y'}} < B$ for any two $\mathbf{Y}, \mathbf{Y'} \in \mathcal{Y}$ and $\norm{\mathbf{w}} < R$. Then $\norm{f(\mathbf{X}, \mathbf{Y}, \phi) - f(\mathbf{X'}, \mathbf{Y'}, \phi)} \leq 2R(\epsilon+B)$
\end{lemma}

We now prove that the explaining function $f$ is stable. 
\begin{theorem}
The explaining function $f$ is stable with respect to $M(\mathbf{X}, \mathbf{Y})$.
\end{theorem}
\begin{proof}
 From \lemmaref{lemma:energy} and \lemmaref{lemma:energy_bound}, for any $\mathbf{X}, \mathbf{X'} \in \mathcal{X}$ that differ in a single RV $\mathbf{X_i}$ by at most $\epsilon$, we have: $\norm{\dot{\mathbf{Y'}} - \dot{\mathbf{Y}}} \leq \sqrt{\frac{2}{\kappa}R\epsilon \sqrt{Q_G}}$

From \lemmaref{lemma:fisher}, we have $\norm{f(\mathbf{X}, \dot{\mathbf{Y}}, \phi) - f(\mathbf{X'}, \dot{\mathbf{Y'}}, \phi)} \leq 2R(\epsilon + \sqrt{\frac{2}{\kappa}R\epsilon \sqrt{Q_G}})$
 \qedhere
\end{proof}
\section{Experimental Evaluation}
\begin{table*}[!t]
\centering
\footnotesize
\begin{tabular}{|c|c|c|c|c|c|} 
\hline
 & \cora & \multicolumn{2}{c|}{\yelp~} & \multicolumn{2}{c|}{\lastfm~}  \\ 
\hline
 & AUPR & MAE & MSE & MAE & MSE  \\ 
\hline
\BOOST & 0.700 (0.163) & \textbf{0.196} (0.008) & 0.079 (0.007)  & 0.279 (0.058) & 0.11 (0.044) \\
\hline
$\BOOST_{PPLL}$ & 0.651 (0.186) & 0.212 (0.012) & 0.092 (0.013)  & 0.257 (0.046) & 0.0941 (0.058) \\
\hline
\PRA & 0.622 (0.169)&0.2005 (0.0004) & 0.086 (0.0004) & 0.186 (0.001) & 0.048 (0.0004) \\ 
\hline
\SL & 0.684 (0.148) & \textbf{0.193} (0.003) & \textbf{0.065} (0.008) & \textbf{0.177} (0.070) & \textbf{0.043} (0.0004) \\ 
\hline
\end{tabular}
\caption{ \textbf{Metrics:} Our \SL~ approach  significantly outperforms other approaches on recommendation datasets and is comparable to \BOOST~ on \cora. Numbers in bold are  statistically significant with $p<0.05$. }
\label{tab:res_mse}
\end{table*}

\label{sec:experiments}
We investigate the following research questions empirically:
RQ1) What is the predictive accuracy of models discovered by \SL~?
RQ2) What is the impact of the explainability parameter $\gamma$ on end-user explainability?
RQ3) How well can the predictions be explained?

\textbf{Datasets:} We evaluate the predictive accuracy of the discovered models on an entity resolution dataset and two recommendation datasets. Further, for the recommendation datasets, we evaluate the generated explanations. More details  are given in the supplementary material. \\
\textbf{\cora~:} This is an entity resolution dataset containing 10 predicates such as the title, venue, author, words in the title and authors that refer to the same entity.
The task is to predict publication pairs that refer to same entity.\\
\textbf{\yelp~:} This is a restaurant recommendation dataset containing 34,454 users, 3,605 restaurants,  8,512 friendship links and 99,049 observed ratings.\\
\textbf{\lastfm~:} This is a music artist recommendation dataset containing  1,892 users, 17,632 music artists, 12,717 friendship links and  92,834 observed ratings.\\
Both the recommendation datasets contain a total of 21 relations such as user and item similarities, and the output of external classifiers such non-negative matrix factorization (NMF).
For both datasets, the task is to predict the unobserved ratings. 
We classified the relations that encode similarity between users or items as explainable
and other relations such the output of latent factor models such as NMF as non-explainable;
in the end, 15 of the 21 relations were classified as explainable to the end-user.
To prevent the generation of a quadratic number of user-item pairs, we perform \textit{blocking}.
Blocking restricts the rating pairs by identifying the \textit{important pairs} using a simple heuristic.
We use the splits from \citet{kouki:recsys15}.

\textbf{Approaches:}
We evaluate by comparing the following structure learning methods:\\
\noindent \textbf{\BOOST \citep{khot:icdm11}}: This is a state-of-the-art structure learning approach for MLNs.
It  uses Friedman’s functional gradient boosting algorithm to generate a series of relational regression problems, which in turn are used to generate the rules in the model.
We use the code of \cite{khot:icdm11} with the recursion flag set to True. %\footnote{https://starling.utdallas.edu/software/boostsrl}.
\BOOST\  uses Boolean logic, so we round the values of the ground atoms to $1$ if the value is greater that $0.5$, and $0$ otherwise.   We learn 10 trees and combined the rules across the trees to generate a PSL model.  We use the same weights learned by the \BOOST~ approach. 
Since PSL only allows positive weights, we truncate negative weights to $0$.
In addition, we also evaluate a model with the weights learned using the PPLL objective ($\BOOST_{PPLL}$). Here, we considered all rules discovered by \BOOST\ including rules with negative weights. \\
\noindent \textbf{\PRA \citep{gardner2015efficient}}: \PRA~is a relational path finding algorithm that identifies paths that connect unobserved pairs by performing random walks.
We use the code of \citet{gardner2015efficient} to identify paths of length up to three.
We then convert these paths to PSL rules. 
We learn the rule weights using our proposed PPLL weight learning method. We considered all rules including rules with negative weights. \\\\
\noindent {\textbf{\SL}}\footnote{Code, model, and data available at https://github.com/linqs/embar-uai22}: Our proposed approach that performs a structured search to learn an explainable PSL model.
We use the rule templates described in \ref{sec:sl}.
We set the maximum number of rules in a model to $15$, maximum iterations to $100$ and $\gamma=0$.

\subsection{Predictive performance of \SL~}
We evaluate [RQ1] by comparing the predictive accuracies of \BOOST, $\BOOST_{PPLL}$, \PRA~ and \SL~. We compute the positive class AUPR for the \cora~ dataset. For the recommendation datasets we compute the mean squared error (MSE) and mean absolute error (MAE) by rescaling the ratings between $[0,1]$. 
\tabref{tab:res_mse} shows the mean and standard deviation of the metrics computed across the 5 folds.
We perform a paired t-test to measure significance and the numbers in bold are statistically significant  with $p<0.05$ .
First, we observe that the \SL~ approach outperforms both versions of \BOOST~ and \PRA~ on the recommendation datasets.
On the entity resolution dataset, it outperforms \PRA~ and is comparable to \BOOST.
\PRA~ can only discover rules that are paths and this limitation hurts the performance of the model.
We next observe that the \BOOST~models perform better than the $\BOOST_{PPLL}$ model.
The \BOOST~ method did not learn any collective rules such as: 
$\pslpred{Rating}(\pslarg{A}, \pslarg{B}) \land \pslpred{SimItem}(\pslarg{B}, \pslarg{C})  \implies \pslpred{Rating}(\pslarg{A}, \pslarg{C})$. 
These content-based rules are important for the recommender system performance.
As a result, the \SL~ performs better than \BOOST.
The learned models for all approaches are given in the supplementary material. 

\SL~ discovered social rules such as  $\pslpred{Friends}(\pslarg{U_1}, \pslarg{U_2}) \land \pslpred{Rating}(\pslarg{U_1}, \pslarg{I_1}) \implies \pslpred{Rating}(\pslarg{U_2}, \pslarg{I_1})$, similarity rules such  $\pslpred{SimItem_{Pearson}}(\pslarg{I_1}, \pslarg{I_2}) \land \pslpred{Rating}(\pslarg{U_1}, \pslarg{I_1}) \implies \pslpred{Rating}(\pslarg{U_1}, \pslarg{I_2})$. Further, the model incorporates external systems such as Bayesian Probabilistic Matrix Factorization (BPMF) with rules such as $\pslpred{BPMF}(\pslarg{U_1}, \pslarg{I_1})  \implies \pslpred{Rating}(\pslarg{U_1}, \pslarg{I_1})$.

\subsection{Trade-off between predictive accuracy and explainability}
We evaluate [RQ2] by investigating the impact of the explainability parameter $\gamma$ on a model's predictive accuracy and end-user explainability.
For each prediction, we generated a ranked list of ground rules in $\mathbf{G_i}$ and compute \emph{mean explainable precision}(MEP@K) \citep{abdollahi:www16} that represents that fraction of ratings that are explainable.
 MEP@K is defined as $\frac{1}{|\mathbf{Y}|}\sum_{i=1}^{|\mathbf{Y}|}(\mathcal{E}_k(\mathbf{G_i}))$, where $\mathcal{E}_{k}(\mathbf{G_i})$ is
 $1$ 
 if one of the top-K ranked rules in $\mathbf{G_i}$ is explainable and zero otherwise.
We 
consider
a rule to be explainable if it contains at least one explainable predicate $(\alpha =  0.25)$.
We modified $\gamma$ from 0 to 1 
and computed the MEP@1 and MSE of all generated models.

\figref{fig:fid_vs_gamma} shows the change in MSE and MEP@1 as we vary $\gamma$ for the LASTFM dataset.
We observe that, not surprisingly, we generate models with more explainable rules as we increase $\gamma$.
However, the MSE also increases slightly.
This is due to the model not containing non-explainable rules such as latent factor models that have high predictive accuracy.
We found a similar pattern on the YELP dataset.
\begin{figure}[!t]
    \centering
    	    \resizebox{0.32\textwidth}{!}{\input{explainability_vs_gamma.tex}}
    	    \caption{\textbf{MEP vs MSE for \lastfm:}   As $\gamma$ increases, the models become more explainable and have a slightly higher MSE.}
    	    \label{fig:fid_vs_gamma}
\end{figure}
\begin{figure}[!t]
    \centering
            \resizebox{0.30\textwidth}{!}{\input{lastfm_explainability.tex}}
    	    \caption{\textbf{MEP @ K for \lastfm: }MEP increases for all approaches as we increase $K$. \SL~ with $\gamma > 0.7$  outperforms \BOOST and \PRA.}
	        \label{fig:fidelity}
\end{figure}

\subsubsection{Analysis of Explanations}

We evaluate [RQ3] by analyzing the MEP for all models at $K=\{1,2,3\}$.
\figref{fig:fidelity} shows the MEP@K for various approaches.
As we increase the value for K, the MEP value increases for all approaches.
For \SL, we get a MEP of 1 for $\gamma > 0.7$ for all $K$.
\PRA~has MEP close to $0.9$ due to the large number of rules in the model.
\BOOST~starts with MEP close to $0.5$ at $K=1$ but increases rapidly as we increase $K$.

As a concrete example of our results, we look at an example of the most important explanation identified by our approach for a rating in the \lastfm~dataset.
For the pair $(User12, Artist5)$ \SL~ identified the most important rule as: $\pslpred{MF}(\pslarg{User12}, \pslarg{Artist5}) \pslthen \pslpred{Rating}(\pslarg{User12}, \pslarg{Artist5})$ when $\gamma$ was set to 0.
This is a non-explainable rule.
When we changed $\gamma=1$, the most important ground rule became: $\pslpred{Rating}(\pslarg{User12}, \pslarg{Artist29}) \psland
\pslpred{SimItem_{jaccard}}(\pslarg{Artist29}, \pslarg{Artist5}) \pslthen \pslpred{Rating}(\pslarg{User12}, \pslarg{Artist5})$.
This is explainable.

\section{Conclusion}
\label{sec:conclusion}
We proposed an efficient approach to learning explainable templated graphical models that trades off between performance and explainability.  Our explanation framework satisfies the properties of explicitness, faithfulness and stability and our search algorithm integrates efficient structure and weight learning.
We show that we can learn more explainable model then existing SOTA approaches without compromising much on accuracy.
Our work suggests several future directions. Latent predicates are crucial for improving model performance, and we plan to extend our approach to handle them.
In addition, we could incorporate  end-user preferences  into the explanation ranking.

\section{Acknowledgements}
This work was partially supported by the National Science Foundation grants (CCF-1740850, CCF-2023495) and an unrestricted gift from Google.
\bibliography{references}

\end{document}
