# Theoretical Proofs for KARMA Framework

## Proof of Theorem 1: Convergence of Causal Structure Learning

**Theorem 1:** Under assumptions of causal sufficiency, faithfulness, and sufficient data, the knowledge-constrained PC algorithm converges to the true causal graph (or its Markov equivalence class) with probability at least $1-\delta$, where $P(\mathcal{G}_{est} \neq \mathcal{G}_{true}) \leq \delta \leq C \cdot \exp(-c \cdot n)$.

### Proof:

The proof proceeds by establishing that the knowledge-constrained PC algorithm maintains the correctness guarantees of the standard PC algorithm while improving convergence through knowledge constraints.

**Step 1: Establish the foundation from standard PC algorithm theory**

The standard PC algorithm is known to be sound and complete under the assumptions of causal sufficiency, faithfulness, and perfect conditional independence tests. Specifically, if $\mathcal{G}_{true}$ is the true causal DAG and $\mathcal{D}$ represents the data distribution faithful to $\mathcal{G}_{true}$, then the PC algorithm returns a graph $\mathcal{G}_{est}$ such that $\mathcal{G}_{est}$ and $\mathcal{G}_{true}$ are Markov equivalent.

**Step 2: Analyze the effect of knowledge constraints**

Let $\mathcal{K}$ represent the knowledge constraints derived from the knowledge graph. We assume that these constraints are consistent with the true causal structure, i.e., for any constraint $c \in \mathcal{K}$, we have $c(\mathcal{G}_{true}) = \text{true}$.

The knowledge-constrained PC algorithm modifies the conditional independence testing procedure by adjusting the significance threshold based on knowledge consistency:

$$\alpha_{adjusted}(X, Y | S) = \alpha \cdot (1 + \lambda_{kc} \cdot \text{KnowledgeConsistency}(X, Y, S))$$

where $\text{KnowledgeConsistency}(X, Y, S) \in [-1, 1]$ measures how well the potential edge $(X, Y)$ aligns with the knowledge constraints.

**Step 3: Prove that knowledge constraints preserve correctness**

We need to show that the modified algorithm does not introduce false positives or false negatives beyond those that would occur in the standard PC algorithm.

*Case 1: True Independence* 
If $X \perp\kern-5pt\perp Y | S$ in the true distribution, then the conditional independence test should accept the null hypothesis. The knowledge-constrained version increases the effective significance level when knowledge supports the independence, making the test more conservative. This can only reduce false positives (incorrectly rejecting independence), not increase them.

*Case 2: True Dependence*
If $X \not\perp\kern-5pt\perp Y | S$ in the true distribution, then the conditional independence test should reject the null hypothesis. When knowledge supports the dependence, the effective significance level is increased, making it easier to detect the dependence. When knowledge contradicts the dependence, we assume the knowledge is incorrect (violating our consistency assumption) or the test has insufficient power.

**Step 4: Establish finite sample convergence rate**

For the finite sample case, we use concentration inequalities for conditional independence tests. Let $T_{n}(X, Y | S)$ be the test statistic for conditional independence with $n$ samples. Under the null hypothesis of independence:

$$P(|T_n(X, Y | S)| > t) \leq 2\exp\left(-\frac{nt^2}{2\sigma^2}\right)$$

where $\sigma^2$ is the variance of the test statistic.

The knowledge-constrained algorithm modifies the rejection threshold from $t_{\alpha}$ to $t_{\alpha_{adjusted}}$. Since knowledge constraints are assumed to be correct, this modification can only improve the convergence rate by:

1. Reducing false edge removals when knowledge supports the edge
2. Accelerating correct edge removals when knowledge contradicts the edge

**Step 5: Derive the final convergence bound**

The probability of error in the knowledge-constrained PC algorithm is bounded by:

$$P(\mathcal{G}_{est} \neq \mathcal{G}_{true}) \leq \sum_{X,Y,S} P(\text{incorrect decision on } (X,Y|S))$$

Each individual test error is bounded by the concentration inequality. With $m$ total conditional independence tests performed, and assuming knowledge constraints reduce the error probability by a factor $(1-\kappa)$ where $\kappa > 0$:

$$P(\mathcal{G}_{est} \neq \mathcal{G}_{true}) \leq m \cdot (1-\kappa) \cdot 2\exp\left(-\frac{n\alpha^2}{2\sigma^2}\right)$$

Setting $C = 2m(1-\kappa)$ and $c = \alpha^2/(2\sigma^2)$, we obtain:

$$P(\mathcal{G}_{est} \neq \mathcal{G}_{true}) \leq C \cdot \exp(-c \cdot n)$$

This completes the proof of Theorem 1. □

## Proof of Theorem 2: Policy Invariance for Knowledge-Based Shaping

**Theorem 2:** If $R_{knowledge}$ is designed as a potential-based shaping function derived from the knowledge graph, i.e., $R_{knowledge}(s, a, s') = \gamma \Phi(s') - \Phi(s)$ for some potential function $\Phi: \mathcal{S} \rightarrow \mathbb{R}$, and the knowledge is consistent with the true optimal policy, then it does not alter the optimal policy in the original MDP.

### Proof:

This theorem extends the classical result on potential-based reward shaping to the knowledge-augmented setting.

**Step 1: Define the augmented MDP**

Let $\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma)$ be the original MDP and $\mathcal{M}' = (\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}', \gamma)$ be the augmented MDP where:

$$\mathcal{R}'(s, a, s') = \mathcal{R}(s, a, s') + R_{knowledge}(s, a, s') = \mathcal{R}(s, a, s') + \gamma \Phi(s') - \Phi(s)$$

**Step 2: Relate the value functions**

For any policy $\pi$, the value function in the augmented MDP is:

$$V'_\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t \mathcal{R}'(s_t, a_t, s_{t+1}) \mid s_0 = s\right]$$

Substituting the definition of $\mathcal{R}'$:

$$V'_\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t (\mathcal{R}(s_t, a_t, s_{t+1}) + \gamma \Phi(s_{t+1}) - \Phi(s_t)) \mid s_0 = s\right]$$

**Step 3: Separate the terms**

$$V'_\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t \mathcal{R}(s_t, a_t, s_{t+1}) \mid s_0 = s\right] + \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t (\gamma \Phi(s_{t+1}) - \Phi(s_t)) \mid s_0 = s\right]$$

The first term is simply $V_\pi(s)$, the value function in the original MDP.

**Step 4: Evaluate the telescoping sum**

For the second term, we have a telescoping series:

$$\mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t (\gamma \Phi(s_{t+1}) - \Phi(s_t)) \mid s_0 = s\right]$$

$$= \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} (\gamma^{t+1} \Phi(s_{t+1}) - \gamma^t \Phi(s_t)) \mid s_0 = s\right]$$

$$= \mathbb{E}_\pi\left[\lim_{T \to \infty} (\gamma^{T+1} \Phi(s_{T+1}) - \Phi(s_0)) \mid s_0 = s\right]$$

**Step 5: Apply the boundedness assumption**

Assuming that $\Phi$ is bounded and $\gamma < 1$, we have $\lim_{T \to \infty} \gamma^{T+1} \Phi(s_{T+1}) = 0$. Therefore:

$$\mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t (\gamma \Phi(s_{t+1}) - \Phi(s_t)) \mid s_0 = s\right] = -\Phi(s)$$

**Step 6: Establish the relationship between value functions**

Combining the results:

$$V'_\pi(s) = V_\pi(s) - \Phi(s)$$

This relationship holds for any policy $\pi$.

**Step 7: Prove policy invariance**

For any two policies $\pi_1$ and $\pi_2$:

$$V'_{\pi_1}(s) - V'_{\pi_2}(s) = (V_{\pi_1}(s) - \Phi(s)) - (V_{\pi_2}(s) - \Phi(s)) = V_{\pi_1}(s) - V_{\pi_2}(s)$$

Therefore, the relative ordering of policies is preserved. If $\pi^*$ is optimal in $\mathcal{M}$ (i.e., $V_{\pi^*}(s) \geq V_\pi(s)$ for all $\pi$ and $s$), then $\pi^*$ is also optimal in $\mathcal{M}'$ (i.e., $V'_{\pi^*}(s) \geq V'_\pi(s)$ for all $\pi$ and $s$).

This completes the proof of Theorem 2. □

## Proof of Theorem 3: Sample Efficiency Improvement

**Theorem 3:** If the causal model correctly identifies direct causes of reward and $R_{causal}$ provides consistent guidance, KARMA can achieve a provable reduction in sample complexity compared to learning with $r$ alone. The sample complexity $N_{KARMA}$ for achieving an $\epsilon$-optimal policy can be bounded by $N_{KARMA} = O((1-\kappa) N_{standard})$, where $\kappa \in [0,1)$ is an efficiency gain factor.

### Proof:

The proof establishes that the additional information provided by knowledge and causal insights reduces the effective exploration space and improves the signal-to-noise ratio in policy learning.

**Step 1: Define the sample complexity framework**

Let $N_{standard}(\epsilon, \delta)$ be the sample complexity required for a standard RL algorithm to find an $\epsilon$-optimal policy with probability at least $1-\delta$. We want to show that KARMA requires $N_{KARMA}(\epsilon, \delta) = O((1-\kappa) N_{standard}(\epsilon, \delta))$ samples.

**Step 2: Analyze the effect of knowledge-based reward shaping**

The knowledge-based component $R_{knowledge}$ provides additional structure to the reward function. If the knowledge is accurate, it effectively reduces the variance of the reward signal by providing consistent guidance toward good states and actions.

Let $\sigma^2_r$ be the variance of the original reward signal and $\sigma^2_{r'}$ be the variance of the augmented reward signal. Under the assumption that knowledge provides consistent guidance:

$$\sigma^2_{r'} \leq (1-\kappa_K) \sigma^2_r$$

where $\kappa_K > 0$ represents the variance reduction factor from knowledge.

**Step 3: Analyze the effect of causal reward adjustment**

The causal component $R_{causal}$ provides information about the true causal effect of actions on rewards. This reduces the confounding effects of spurious correlations and improves the signal quality.

Let $\text{SNR}_r$ be the signal-to-noise ratio of the original reward and $\text{SNR}_{r'}$ be the signal-to-noise ratio of the augmented reward. The causal adjustment improves the signal-to-noise ratio:

$$\text{SNR}_{r'} \geq (1+\kappa_C) \text{SNR}_r$$

where $\kappa_C > 0$ represents the signal improvement factor from causal insights.

**Step 4: Apply concentration inequalities**

For policy gradient methods, the sample complexity is typically bounded by:

$$N_{standard}(\epsilon, \delta) = O\left(\frac{\sigma^2_r}{\epsilon^2 \cdot \text{SNR}_r} \log\frac{1}{\delta}\right)$$

With the improvements from KARMA:

$$N_{KARMA}(\epsilon, \delta) = O\left(\frac{(1-\kappa_K)\sigma^2_r}{\epsilon^2 \cdot (1+\kappa_C)\text{SNR}_r} \log\frac{1}{\delta}\right)$$

**Step 5: Derive the efficiency gain factor**

The ratio of sample complexities is:

$$\frac{N_{KARMA}(\epsilon, \delta)}{N_{standard}(\epsilon, \delta)} = \frac{(1-\kappa_K)}{(1+\kappa_C)}$$

Setting $\kappa = 1 - \frac{(1-\kappa_K)}{(1+\kappa_C)} = \frac{\kappa_K + \kappa_C}{1+\kappa_C}$, we obtain:

$$N_{KARMA}(\epsilon, \delta) = (1-\kappa) N_{standard}(\epsilon, \delta)$$

**Step 6: Establish the bounds on $\kappa$**

Since $\kappa_K, \kappa_C \geq 0$, we have $\kappa \geq 0$. The upper bound $\kappa < 1$ follows from the fact that knowledge and causal insights, while helpful, cannot reduce the sample complexity to zero (some exploration is always necessary).

The efficiency gain factor $\kappa$ depends on:
- The quality and relevance of the domain knowledge ($\kappa_K$)
- The accuracy of the learned causal model ($\kappa_C$)
- The alignment between knowledge/causal insights and the true optimal policy

This completes the proof of Theorem 3. □

## Proof of Theorem 4: Convergence of KARMA-RL

**Theorem 4:** Under assumptions of a stationary environment, sufficiently accurate causal model $\mathcal{C}$, and appropriate dynamic weighting, the RL agent learning with KARMA-adjusted reward converges to a policy $\pi_K$ such that $||V^{\pi_K} - V^{\pi*}||_{\infty} \leq \epsilon_K + \epsilon_C$.

### Proof:

This proof establishes that KARMA maintains the convergence guarantees of the underlying RL algorithm while bounding the error introduced by imperfect knowledge and causal modeling.

**Step 1: Decompose the total error**

The total error in the learned policy can be decomposed into three components:
1. Approximation error from the base RL algorithm: $\epsilon_{RL}$
2. Error from imperfect knowledge: $\epsilon_K$
3. Error from imperfect causal modeling: $\epsilon_C$

**Step 2: Establish convergence of the base algorithm**

Assume the base RL algorithm (e.g., PPO, SAC) converges to an $\epsilon_{RL}$-optimal policy in the augmented MDP with reward $\mathcal{R}'$. This is a standard assumption for most modern RL algorithms under appropriate conditions.

**Step 3: Bound the knowledge error**

The knowledge error arises from inaccuracies in the domain knowledge or its representation. Let $\Phi_{true}$ be the true potential function that would provide optimal guidance, and $\Phi_{est}$ be the estimated potential function from the knowledge graph.

The knowledge-based reward error is:
$$|R_{knowledge}(s,a,s') - R_{true\_knowledge}(s,a,s')| = |\gamma(\Phi_{est}(s') - \Phi_{true}(s')) - (\Phi_{est}(s) - \Phi_{true}(s))|$$

$$\leq \gamma|\Phi_{est}(s') - \Phi_{true}(s')| + |\Phi_{est}(s) - \Phi_{true}(s)| \leq (1+\gamma)||\Phi_{est} - \Phi_{true}||_{\infty}$$

If the knowledge representation error is bounded by $\delta_K$, then:
$$||R_{knowledge} - R_{true\_knowledge}||_{\infty} \leq (1+\gamma)\delta_K$$

**Step 4: Bound the causal error**

The causal error arises from inaccuracies in the learned causal model. Let $\mathcal{C}_{true}$ be the true causal structure and $\mathcal{C}_{est}$ be the estimated structure.

The causal reward error depends on:
1. Structural errors in the causal graph
2. Functional form errors in the SCM
3. Estimation errors in the counterfactual computation

Under the assumption that the causal model is sufficiently accurate, we can bound:
$$||R_{causal} - R_{true\_causal}||_{\infty} \leq \delta_C$$

where $\delta_C$ depends on the quality of causal discovery and SCM estimation.

**Step 5: Apply the dynamic weighting analysis**

The KARMA reward is:
$$\mathcal{R}'(s,a,s') = \mathcal{R}(s,a,s') + w_K(t) R_{knowledge}(s,a,s') + w_C(t) R_{causal}(s,a,s')$$

The total reward error is bounded by:
$$||\mathcal{R}' - \mathcal{R}_{ideal}||_{\infty} \leq w_K(t) \cdot (1+\gamma)\delta_K + w_C(t) \cdot \delta_C$$

**Step 6: Establish the convergence bound**

Using the performance difference lemma for MDPs, the value function error is bounded by:

$$||V^{\pi_K} - V^{\pi*}||_{\infty} \leq \frac{1}{1-\gamma} \left( \epsilon_{RL} + w_K(\infty) \cdot (1+\gamma)\delta_K + w_C(\infty) \cdot \delta_C \right)$$

**Step 7: Apply the dynamic weighting assumptions**

Under appropriate dynamic weighting where:
- $w_K(t) \to w_K^*$ as $t \to \infty$
- $w_C(t) \to w_C^*$ as $t \to \infty$
- The weights are chosen to minimize the total error

Setting $\epsilon_K = \frac{w_K^* \cdot (1+\gamma)\delta_K}{1-\gamma}$ and $\epsilon_C = \frac{w_C^* \cdot \delta_C}{1-\gamma}$, and assuming the base RL algorithm converges exactly ($\epsilon_{RL} \to 0$), we obtain:

$$||V^{\pi_K} - V^{\pi*}||_{\infty} \leq \epsilon_K + \epsilon_C$$

**Step 8: Perfect knowledge and causal model case**

When both the knowledge and causal model are perfect ($\delta_K = \delta_C = 0$), the bound becomes:
$$||V^{\pi_K} - V^{\pi*}||_{\infty} \leq \epsilon_{RL}$$

This means KARMA converges to the same solution as the base RL algorithm would in the ideal case, confirming that perfect knowledge and causal understanding do not hurt performance.

This completes the proof of Theorem 4. □

## Additional Theoretical Results

### Lemma 1: Stability of Knowledge Graph Embeddings

**Lemma 1:** Under mild regularity conditions, the knowledge graph embeddings learned by the TransE model are stable with respect to small perturbations in the knowledge graph structure.

**Proof Sketch:** The proof follows from the Lipschitz continuity of the TransE objective function and the stability of gradient-based optimization methods. Small changes in the knowledge graph (addition or removal of a few edges) result in bounded changes in the learned embeddings.

### Lemma 2: Identifiability of Causal Effects

**Lemma 2:** Under the assumptions of causal sufficiency and the availability of interventional data (through the RL agent's actions), the causal effects used in the reward adjustment are identifiable.

**Proof Sketch:** The proof uses the theory of causal identifiability from Pearl's causal hierarchy. The RL setting naturally provides interventional data through the agent's action choices, which enables the identification of causal effects that would not be identifiable from observational data alone.

### Lemma 3: Robustness to Knowledge Inconsistencies

**Lemma 3:** KARMA maintains bounded performance degradation even when the domain knowledge contains inconsistencies or errors, provided the error rate is below a threshold.

**Proof Sketch:** The proof shows that the dynamic weighting mechanism can adapt to reduce the influence of inconsistent knowledge over time, and the causal learning component can partially compensate for knowledge errors by learning from data.

These theoretical results provide a comprehensive foundation for understanding the convergence, stability, and robustness properties of the KARMA framework, ensuring that the practical benefits observed in experiments are grounded in solid theoretical principles.

