# Algorithm 4: Complete KARMA Training Loop

The complete KARMA training algorithm integrates all components into a unified learning framework that alternates between policy learning and model updates.

```
Algorithm 4: KARMA Training Loop
Input: Environment env, initial policy π_0, knowledge graph G, hyperparameters
Output: Trained policy π*, learned causal model C*

1: // Initialize components
2: π ← π_0
3: knowledge_embeddings ← train_knowledge_embeddings(G)
4: causal_model ← initialize_causal_model()
5: SCM ← initialize_SCM()
6: experience_buffer ← ∅

7: for episode = 1 to max_episodes do
8:    // Collect trajectory
9:    trajectory ← ∅
10:   state ← env.reset()
11:   
12:   for step = 1 to max_steps do
13:      // Augment state with knowledge
14:      augmented_state ← integrate_knowledge(state, G, knowledge_embeddings)
15:      
16:      // Select action
17:      action ← π.select_action(augmented_state)
18:      
19:      // Execute action
20:      next_state, reward, done ← env.step(action)
21:      
22:      // Compute adjusted reward
23:      if len(experience_buffer) > min_buffer_size then
24:         R_knowledge ← compute_knowledge_reward(state, action, next_state, G)
25:         R_causal ← compute_causal_reward(state, action, reward, next_state, SCM)
26:         adjusted_reward ← combine_rewards(reward, R_knowledge, R_causal, episode)
27:      else
28:         adjusted_reward ← reward
29:      end if
30:      
31:      // Store experience
32:      experience ← (state, action, adjusted_reward, next_state, done)
33:      trajectory ← trajectory ∪ {experience}
34:      experience_buffer ← experience_buffer ∪ {experience}
35:      
36:      state ← next_state
37:      if done then break
38:   end for
39:   
40:   // Update policy
41:   if len(experience_buffer) > batch_size then
42:      batch ← sample_batch(experience_buffer, batch_size)
43:      π ← update_policy(π, batch)
44:   end if
45:   
46:   // Update causal model periodically
47:   if episode % causal_update_frequency == 0 then
48:      causal_data ← extract_causal_data(experience_buffer)
49:      causal_model ← update_causal_model(causal_model, causal_data, G)
50:      SCM ← update_SCM(SCM, causal_model, causal_data)
51:   end if
52:   
53:   // Update knowledge embeddings periodically
54:   if episode % knowledge_update_frequency == 0 then
55:      knowledge_embeddings ← update_knowledge_embeddings(G, experience_buffer)
56:   end if
57: end for

58: return π, causal_model
```

This complete algorithm demonstrates how KARMA integrates knowledge representation, causal learning, and reward adjustment into a cohesive training framework. The periodic updates of causal models and knowledge embeddings ensure that the system continuously improves its understanding of the environment while learning effective policies.

The algorithms presented here provide the detailed implementation foundation for the KARMA framework, enabling researchers to reproduce and extend the work while maintaining the theoretical guarantees established in the main paper.

