# Detailed Algorithms and Pseudocode for KARMA Framework

## Algorithm 1: Knowledge Graph Construction and Embedding

The knowledge representation module forms the foundation of KARMA by converting domain knowledge into structured, learnable representations. This algorithm details the process of constructing knowledge graphs from various sources and learning embeddings that can be integrated with reinforcement learning agents.

### Input Processing and Knowledge Graph Construction

The knowledge graph construction process begins with the aggregation of domain knowledge from multiple sources including expert rules, ontologies, and textual descriptions. The algorithm systematically processes these heterogeneous knowledge sources to create a unified graph representation.

```
Algorithm 1.1: Knowledge Graph Construction
Input: Domain knowledge sources K_sources = {rules, ontologies, text}
Output: Knowledge graph G = (E, R_kg, T)

1: Initialize empty sets E ← ∅, R_kg ← ∅, T ← ∅
2: for each source s in K_sources do
3:    entities_s ← extract_entities(s)
4:    relations_s ← extract_relations(s)
5:    triples_s ← extract_triples(s)
6:    E ← E ∪ entities_s
7:    R_kg ← R_kg ∪ relations_s
8:    T ← T ∪ triples_s
9: end for
10: G ← construct_graph(E, R_kg, T)
11: G ← resolve_conflicts(G)  // Handle conflicting information
12: G ← validate_consistency(G)  // Ensure logical consistency
13: return G
```

The entity extraction process employs named entity recognition for textual sources and structured parsing for formal ontologies. Relation extraction utilizes dependency parsing and semantic role labeling to identify relationships between entities. The conflict resolution step addresses inconsistencies between different knowledge sources using confidence scores and source reliability metrics.

### Knowledge Graph Embedding Learning

Once the knowledge graph is constructed, the algorithm learns dense vector representations for entities and relations using the TransE embedding model, which has proven effective for capturing semantic relationships in knowledge graphs.

```
Algorithm 1.2: Knowledge Graph Embedding Learning
Input: Knowledge graph G = (E, R_kg, T), embedding dimension d
Output: Entity embeddings {e_i ∈ R^d}, relation embeddings {r_j ∈ R^d}

1: Initialize entity embeddings E_emb ← random_normal(|E|, d)
2: Initialize relation embeddings R_emb ← random_normal(|R_kg|, d)
3: Initialize optimizer Adam with learning rate α_kg
4: for epoch = 1 to max_epochs do
5:    for each triple (h, r, t) in T do
6:       // Positive sample
7:       score_pos ← ||h + r - t||_2
8:       
9:       // Generate negative samples
10:      (h', r', t') ← corrupt_triple(h, r, t)
11:      score_neg ← ||h' + r' - t'||_2
12:      
13:      // Margin-based ranking loss
14:      loss ← max(0, γ_kg + score_pos - score_neg)
15:      
16:      // Update embeddings
17:      update_embeddings(E_emb, R_emb, loss)
18:   end for
19: end for
20: return E_emb, R_emb
```

The negative sampling strategy in line 10 employs a combination of random corruption and adversarial sampling to generate challenging negative examples. The corruption process randomly replaces either the head entity, relation, or tail entity while ensuring the corrupted triple is not present in the original knowledge graph.

### Knowledge-State Integration

The integration of knowledge graph embeddings with environment states requires a sophisticated mapping mechanism that identifies relevant knowledge for each state and computes attention-weighted representations.

```
Algorithm 1.3: Knowledge-State Integration
Input: State s, knowledge graph G, embeddings E_emb, R_emb
Output: Augmented state representation s'

1: // Map state to relevant entities
2: relevant_entities ← ∅
3: for each entity e in E do
4:    similarity ← compute_similarity(s, e)
5:    if similarity > θ_map then
6:       relevant_entities ← relevant_entities ∪ {e}
7:    end if
8: end for

9: // Compute attention weights
10: attention_scores ← ∅
11: for each entity e in relevant_entities do
12:    score ← neural_network(concat(s, E_emb[e]))
13:    attention_scores[e] ← score
14: end for
15: attention_weights ← softmax(attention_scores)

16: // Aggregate knowledge context
17: knowledge_context ← 0
18: for each entity e in relevant_entities do
19:    knowledge_context += attention_weights[e] * E_emb[e]
20: end for

21: // Create augmented state
22: s' ← concat(s, knowledge_context)
23: return s'
```

The similarity computation in line 4 utilizes a learned similarity function that maps state features to entity descriptions using semantic embeddings. The attention mechanism ensures that the most relevant knowledge is emphasized while maintaining differentiability for end-to-end learning.

## Algorithm 2: Causal Structure Learning with Knowledge Constraints

The causal structure learning module discovers causal relationships between state variables, actions, and rewards while incorporating domain knowledge as constraints. This algorithm combines constraint-based causal discovery with knowledge-guided refinement.

### Variable Selection and Preprocessing

Before applying causal discovery algorithms, the system must select relevant variables and preprocess the data to ensure the assumptions of causal discovery methods are satisfied.

```
Algorithm 2.1: Causal Variable Selection and Preprocessing
Input: State trajectories D = {(s_t, a_t, r_t, s_{t+1})}, knowledge graph G
Output: Processed dataset D_causal, variable set V

1: // Extract relevant variables
2: V_state ← select_state_variables(D, G)
3: V_action ← extract_action_variables(D)
4: V_reward ← extract_reward_components(D)
5: V ← V_state ∪ V_action ∪ V_reward

6: // Preprocess data
7: D_processed ← ∅
8: for each trajectory τ in D do
9:    τ_processed ← discretize_continuous_variables(τ, V)
10:   τ_processed ← handle_missing_values(τ_processed)
11:   τ_processed ← normalize_variables(τ_processed)
12:   D_processed ← D_processed ∪ {τ_processed}
13: end for

14: // Apply temporal constraints
15: temporal_constraints ← extract_temporal_ordering(V)
16: D_causal ← apply_temporal_constraints(D_processed, temporal_constraints)

17: return D_causal, V
```

The variable selection process in lines 2-5 leverages domain knowledge to identify variables that are likely to participate in causal relationships. The discretization step handles continuous variables using adaptive binning methods that preserve causal relationships while enabling the use of discrete causal discovery algorithms.

### Knowledge-Constrained PC Algorithm

The core causal discovery process employs a modified PC algorithm that incorporates knowledge constraints to guide the search process and improve accuracy.

```
Algorithm 2.2: Knowledge-Constrained PC Algorithm
Input: Dataset D_causal, variable set V, knowledge graph G, significance level α
Output: Causal graph C = (V, E_causal)

1: // Initialize complete graph
2: C ← complete_graph(V)
3: knowledge_constraints ← extract_causal_constraints(G, V)

4: // Phase 1: Edge removal based on conditional independence
5: for l = 0 to |V| - 2 do
6:    for each edge (X, Y) in C do
7:       for each subset S ⊆ adjacent(X, C) \ {Y} with |S| = l do
8:          // Test conditional independence
9:          p_value ← conditional_independence_test(X, Y, S, D_causal)
10:         
11:         // Apply knowledge constraints
12:         knowledge_score ← evaluate_knowledge_consistency(X, Y, S, knowledge_constraints)
13:         adjusted_p_value ← p_value * (1 + λ_kc * knowledge_score)
14:         
15:         if adjusted_p_value > α then
16:            remove_edge(C, X, Y)
17:            record_separation_set(X, Y, S)
18:            break
19:         end if
20:      end for
21:   end for
22: end for

23: // Phase 2: Edge orientation
24: C ← orient_edges_with_knowledge(C, knowledge_constraints)
25: C ← apply_orientation_rules(C)

26: return C
```

The knowledge consistency evaluation in line 12 computes a score based on how well the potential causal relationship aligns with the domain knowledge encoded in the knowledge graph. This score modulates the significance threshold for conditional independence tests, making the algorithm more conservative about removing edges that are supported by domain knowledge.

### Causal Graph Refinement and Validation

After the initial causal discovery, the algorithm refines the learned structure using additional knowledge constraints and validates the results through cross-validation and expert review.

```
Algorithm 2.3: Causal Graph Refinement
Input: Initial causal graph C_init, knowledge graph G, dataset D_causal
Output: Refined causal graph C_refined

1: C_refined ← C_init
2: knowledge_edges ← extract_known_causal_edges(G)
3: discovered_edges ← get_edges(C_init)

4: // Add missing knowledge-supported edges
5: for each edge (X, Y) in knowledge_edges do
6:    if (X, Y) not in discovered_edges then
7:       confidence ← estimate_edge_confidence(X, Y, D_causal)
8:       if confidence > θ_confidence then
9:          add_edge(C_refined, X, Y)
10:         set_edge_weight(C_refined, X, Y, confidence)
11:      end if
12:   end if
13: end for

14: // Remove contradictory edges
15: for each edge (X, Y) in discovered_edges do
16:    if contradicts_knowledge(X, Y, G) then
17:       contradiction_score ← compute_contradiction_score(X, Y, G)
18:       if contradiction_score > θ_contradiction then
19:          remove_edge(C_refined, X, Y)
20:       end if
21:    end if
22: end for

23: // Validate graph properties
24: C_refined ← ensure_acyclicity(C_refined)
25: C_refined ← validate_temporal_ordering(C_refined)

26: return C_refined
```

The edge confidence estimation in line 7 uses bootstrap sampling and cross-validation to assess the reliability of potential causal relationships. The contradiction detection mechanism identifies cases where the data-driven discovery conflicts with established domain knowledge, allowing for careful resolution of these conflicts.

## Algorithm 3: Counterfactual-Based Reward Adjustment

The reward adjustment mechanism forms the core of KARMA's dynamic learning capability. This algorithm computes adjusted rewards by combining original environment rewards with knowledge-based and causality-based components.

### Structural Causal Model Construction

Before computing counterfactual rewards, the algorithm constructs a structural causal model from the learned causal graph and estimates the functional relationships between variables.

```
Algorithm 3.1: Structural Causal Model Construction
Input: Causal graph C, dataset D_causal, variable set V
Output: Structural causal model SCM = {f_i, U_i}

1: SCM ← ∅
2: for each variable X_i in V do
3:    parents ← get_parents(C, X_i)
4:    
5:    // Learn functional relationship
6:    if is_continuous(X_i) then
7:       f_i ← train_regression_model(X_i, parents, D_causal)
8:    else
9:       f_i ← train_classification_model(X_i, parents, D_causal)
10:   end if
11:   
12:   // Estimate noise distribution
13:   residuals ← compute_residuals(f_i, X_i, parents, D_causal)
14:   U_i ← fit_noise_distribution(residuals)
15:   
16:   SCM[X_i] ← {function: f_i, noise: U_i}
17: end for

18: // Validate SCM
19: validation_score ← cross_validate_SCM(SCM, D_causal)
20: if validation_score < θ_scm_quality then
21:    SCM ← refine_SCM(SCM, D_causal)
22: end if

23: return SCM
```

The functional relationship learning in lines 6-10 employs ensemble methods that combine linear models, neural networks, and tree-based methods to capture complex relationships while maintaining interpretability. The noise distribution fitting uses maximum likelihood estimation with support for various distribution families.

### Knowledge-Based Reward Component

The knowledge-based reward component evaluates how well the agent's actions align with the structured domain knowledge, providing guidance based on expert understanding of the task.

```
Algorithm 3.2: Knowledge-Based Reward Computation
Input: State s, action a, next state s', knowledge graph G, embeddings E_emb
Output: Knowledge-based reward R_knowledge

1: // Identify relevant knowledge paths
2: entities_s ← map_state_to_entities(s, G)
3: entities_s_prime ← map_state_to_entities(s', G)
4: knowledge_paths ← find_paths(G, entities_s, entities_s_prime)

5: // Compute path-based rewards
6: path_rewards ← ∅
7: for each path p in knowledge_paths do
8:    path_weight ← compute_path_weight(p, E_emb)
9:    action_relevance ← compute_action_relevance(p, a)
10:   path_reward ← path_weight * action_relevance
11:   path_rewards ← path_rewards ∪ {path_reward}
12: end for

13: // Aggregate path rewards
14: if |path_rewards| > 0 then
15:    R_knowledge ← weighted_average(path_rewards)
16: else
17:    R_knowledge ← 0
18: end if

19: // Apply knowledge confidence weighting
20: confidence ← estimate_knowledge_confidence(s, a, s', G)
21: R_knowledge ← R_knowledge * confidence

22: return R_knowledge
```

The path finding algorithm in line 4 uses breadth-first search with depth limits to identify meaningful connections between state-related entities. The action relevance computation evaluates how well the taken action aligns with the knowledge path using learned action embeddings.

### Counterfactual Reward Computation

The counterfactual reward component estimates what the reward would have been under alternative actions or conditions, providing a causal perspective on the value of the agent's decisions.

```
Algorithm 3.3: Counterfactual Reward Computation
Input: State s, action a, reward r, next state s', SCM, causal graph C
Output: Counterfactual-based reward R_causal

1: // Identify alternative actions
2: alternative_actions ← get_alternative_actions(a, action_space)
3: counterfactual_rewards ← ∅

4: // Compute counterfactual rewards for alternative actions
5: for each action a' in alternative_actions do
6:    // Perform counterfactual intervention
7:    s_cf, r_cf ← counterfactual_intervention(s, a', SCM, C)
8:    counterfactual_rewards[a'] ← r_cf
9: end for

10: // Compute baseline reward
11: baseline_reward ← compute_baseline(s, counterfactual_rewards)

12: // Estimate actual counterfactual reward
13: actual_cf_reward ← counterfactual_intervention(s, a, SCM, C)[1]

14: // Compute causal advantage
15: R_causal ← actual_cf_reward - baseline_reward

16: // Apply temporal discounting for long-term effects
17: temporal_weight ← compute_temporal_weight(s, a, s')
18: R_causal ← R_causal * temporal_weight

19: return R_causal

function counterfactual_intervention(state, action, SCM, C):
20:    // Set intervention
21:    intervened_vars ← {action}
22:    
23:    // Forward simulation through SCM
24:    cf_state ← state.copy()
25:    cf_state[action] ← action
26:    
27:    for each variable X in topological_order(C) do
28:       if X not in intervened_vars then
29:          parents ← get_parents(C, X)
30:          parent_values ← [cf_state[p] for p in parents]
31:          cf_state[X] ← SCM[X].function(parent_values) + sample(SCM[X].noise)
32:       end if
33:    end for
34:    
35:    cf_reward ← extract_reward(cf_state)
36:    return cf_state, cf_reward
```

The counterfactual intervention function implements Pearl's do-calculus by setting the intervention variable and forward-simulating through the structural causal model. The baseline computation uses various strategies including average counterfactual reward and maximum alternative reward.

### Dynamic Reward Combination

The final step combines the original reward with knowledge-based and causal components using dynamic weights that adapt during training.

```
Algorithm 3.4: Dynamic Reward Combination
Input: Original reward r, knowledge reward R_knowledge, causal reward R_causal, 
       training step t, agent performance metrics
Output: Adjusted reward R'

1: // Compute dynamic weights
2: w_K ← compute_knowledge_weight(t, performance_metrics)
3: w_C ← compute_causal_weight(t, performance_metrics)

4: // Normalize weights
5: total_weight ← 1 + w_K + w_C
6: w_original ← 1 / total_weight
7: w_K ← w_K / total_weight
8: w_C ← w_C / total_weight

9: // Combine rewards
10: R' ← w_original * r + w_K * R_knowledge + w_C * R_causal

11: // Apply clipping to prevent extreme values
12: R' ← clip(R', r_min, r_max)

13: // Update weight adaptation parameters
14: update_weight_parameters(w_K, w_C, performance_metrics)

15: return R'

function compute_knowledge_weight(t, metrics):
16:    // Exponential decay with performance modulation
17:    base_weight ← w_K_0 * exp(-λ_K * t)
18:    performance_factor ← 1 + β_K * (target_performance - current_performance)
19:    return base_weight * performance_factor

function compute_causal_weight(t, metrics):
20:    // Sigmoid growth with confidence modulation
21:    base_weight ← w_C_0 * (1 - exp(-λ_C * t))
22:    confidence_factor ← causal_model_confidence(metrics)
23:    return base_weight * confidence_factor
```

The dynamic weight computation adapts the influence of knowledge and causal components based on training progress and model confidence. Early in training, knowledge receives higher weight, while causal insights become more influential as the causal model becomes more reliable.

## Algorithm 4: Complete KARMA Training Loop

The complete KARMA training algorithm integrates all components into a unified learning framework that alternates between policy learning and model updates.

```
Algorithm 4: KARMA Training Loop
Input: Environment env, initial policy π_0, knowledge graph G, hyperparameters
Output: Trained policy π*, learned causal model C*

1: // Initialize components
2: π ← π_0
3: knowledge_embeddings ← train_knowledge_embeddings(G)
4: causal_model ← initialize_causal_model()
5: SCM ← initialize_SCM()
6: experience_buffer ← ∅

7: for episode = 1 to max_episodes do
8:    // Collect trajectory
9:    trajectory ← ∅
10:   state ← env.reset()
11:   
12:   for step = 1 to max_steps do
13:      // Augment state with knowledge
14:      augmented_state ← integrate_knowledge(state, G, knowledge_embeddings)
15:      
16:      // Select action
17:      action ← π.select_action(augmented_state)
18:      
19:      // Execute action
20:      next_state, reward, done ← env.step(action)
21:      
22:      // Compute adjusted reward
23:      if len(experience_buffer) > min_buffer_size then
24:         R_knowledge ← compute_knowledge_reward(state, action, next_state, G)
25:         R_causal ← compute_causal_reward(state, action, reward, next_state, SCM)
26:         adjusted_reward ← combine_rewards(reward, R_knowledge, R_causal, episode)
27:      else
28:         adjusted_reward ← reward
29:      end if
30:      
31:      // Store experience
32:      experience ← (state, action, adjusted_reward, next_state, done)
33:      trajectory ← trajectory ∪ {experience}
34:      experience_buffer ← experience_buffer ∪ {experience}
35:      
36:      state ← next_state
37:      if done then break
38:   end for
39:   
40:   // Update policy
41:   if len(experience_buffer) > batch_size then
42:      batch ← sample_batch(experience_buffer, batch_size)
43:      π ← update_policy(π, batch)
44:   end if
45:   
46:   // Update causal model periodically
47:   if episode % causal_update_frequency == 0 then
48:      causal_data ← extract_causal_data(experience_buffer)
49:      causal_model ← update_causal_model(causal_model, causal_data, G)
50:      SCM ← update_SCM(SCM, causal_model, causal_data)
51:   end if
52:   
53:   // Update knowledge embeddings periodically
54:   if episode % knowledge_update_frequency == 0 then
55:      knowledge_embeddings ← update_knowledge_embeddings(G, experience_buffer)
56:   end if
57: end for

58: return π, causal_model
```

This complete algorithm demonstrates how KARMA integrates knowledge representation, causal learning, and reward adjustment into a cohesive training framework. The periodic updates of causal models and knowledge embeddings ensure that the system continuously improves its understanding of the environment while learning effective policies.

The algorithms presented here provide the detailed implementation foundation for the KARMA framework, enabling researchers to reproduce and extend the work while maintaining the theoretical guarantees established in the main paper.

