# Algorithm 3: Counterfactual-Based Reward Adjustment

The reward adjustment mechanism forms the core of KARMA's dynamic learning capability. This algorithm computes adjusted rewards by combining original environment rewards with knowledge-based and causality-based components.

## Structural Causal Model Construction

Before computing counterfactual rewards, the algorithm constructs a structural causal model from the learned causal graph and estimates the functional relationships between variables.

```
Algorithm 3.1: Structural Causal Model Construction
Input: Causal graph C, dataset D_causal, variable set V
Output: Structural causal model SCM = {f_i, U_i}

1: SCM ← ∅
2: for each variable X_i in V do
3:    parents ← get_parents(C, X_i)
4:    
5:    // Learn functional relationship
6:    if is_continuous(X_i) then
7:       f_i ← train_regression_model(X_i, parents, D_causal)
8:    else
9:       f_i ← train_classification_model(X_i, parents, D_causal)
10:   end if
11:   
12:   // Estimate noise distribution
13:   residuals ← compute_residuals(f_i, X_i, parents, D_causal)
14:   U_i ← fit_noise_distribution(residuals)
15:   
16:   SCM[X_i] ← {function: f_i, noise: U_i}
17: end for

18: // Validate SCM
19: validation_score ← cross_validate_SCM(SCM, D_causal)
20: if validation_score < θ_scm_quality then
21:    SCM ← refine_SCM(SCM, D_causal)
22: end if

23: return SCM
```

The functional relationship learning in lines 6-10 employs ensemble methods that combine linear models, neural networks, and tree-based methods to capture complex relationships while maintaining interpretability. The noise distribution fitting uses maximum likelihood estimation with support for various distribution families.

## Knowledge-Based Reward Component

The knowledge-based reward component evaluates how well the agent's actions align with the structured domain knowledge, providing guidance based on expert understanding of the task.

```
Algorithm 3.2: Knowledge-Based Reward Computation
Input: State s, action a, next state s', knowledge graph G, embeddings E_emb
Output: Knowledge-based reward R_knowledge

1: // Identify relevant knowledge paths
2: entities_s ← map_state_to_entities(s, G)
3: entities_s_prime ← map_state_to_entities(s', G)
4: knowledge_paths ← find_paths(G, entities_s, entities_s_prime)

5: // Compute path-based rewards
6: path_rewards ← ∅
7: for each path p in knowledge_paths do
8:    path_weight ← compute_path_weight(p, E_emb)
9:    action_relevance ← compute_action_relevance(p, a)
10:   path_reward ← path_weight * action_relevance
11:   path_rewards ← path_rewards ∪ {path_reward}
12: end for

13: // Aggregate path rewards
14: if |path_rewards| > 0 then
15:    R_knowledge ← weighted_average(path_rewards)
16: else
17:    R_knowledge ← 0
18: end if

19: // Apply knowledge confidence weighting
20: confidence ← estimate_knowledge_confidence(s, a, s', G)
21: R_knowledge ← R_knowledge * confidence

22: return R_knowledge
```

The path finding algorithm in line 4 uses breadth-first search with depth limits to identify meaningful connections between state-related entities. The action relevance computation evaluates how well the taken action aligns with the knowledge path using learned action embeddings.

## Counterfactual Reward Computation

The counterfactual reward component estimates what the reward would have been under alternative actions or conditions, providing a causal perspective on the value of the agent's decisions.

```
Algorithm 3.3: Counterfactual Reward Computation
Input: State s, action a, reward r, next state s', SCM, causal graph C
Output: Counterfactual-based reward R_causal

1: // Identify alternative actions
2: alternative_actions ← get_alternative_actions(a, action_space)
3: counterfactual_rewards ← ∅

4: // Compute counterfactual rewards for alternative actions
5: for each action a' in alternative_actions do
6:    // Perform counterfactual intervention
7:    s_cf, r_cf ← counterfactual_intervention(s, a', SCM, C)
8:    counterfactual_rewards[a'] ← r_cf
9: end for

10: // Compute baseline reward
11: baseline_reward ← compute_baseline(s, counterfactual_rewards)

12: // Estimate actual counterfactual reward
13: actual_cf_reward ← counterfactual_intervention(s, a, SCM, C)[1]

14: // Compute causal advantage
15: R_causal ← actual_cf_reward - baseline_reward

16: // Apply temporal discounting for long-term effects
17: temporal_weight ← compute_temporal_weight(s, a, s')
18: R_causal ← R_causal * temporal_weight

19: return R_causal

function counterfactual_intervention(state, action, SCM, C):
20:    // Set intervention
21:    intervened_vars ← {action}
22:    
23:    // Forward simulation through SCM
24:    cf_state ← state.copy()
25:    cf_state[action] ← action
26:    
27:    for each variable X in topological_order(C) do
28:       if X not in intervened_vars then
29:          parents ← get_parents(C, X)
30:          parent_values ← [cf_state[p] for p in parents]
31:          cf_state[X] ← SCM[X].function(parent_values) + sample(SCM[X].noise)
32:       end if
33:    end for
34:    
35:    cf_reward ← extract_reward(cf_state)
36:    return cf_state, cf_reward
```

The counterfactual intervention function implements Pearl's do-calculus by setting the intervention variable and forward-simulating through the structural causal model. The baseline computation uses various strategies including average counterfactual reward and maximum alternative reward.

## Dynamic Reward Combination

The final step combines the original reward with knowledge-based and causal components using dynamic weights that adapt during training.

```
Algorithm 3.4: Dynamic Reward Combination
Input: Original reward r, knowledge reward R_knowledge, causal reward R_causal, 
       training step t, agent performance metrics
Output: Adjusted reward R'

1: // Compute dynamic weights
2: w_K ← compute_knowledge_weight(t, performance_metrics)
3: w_C ← compute_causal_weight(t, performance_metrics)

4: // Normalize weights
5: total_weight ← 1 + w_K + w_C
6: w_original ← 1 / total_weight
7: w_K ← w_K / total_weight
8: w_C ← w_C / total_weight

9: // Combine rewards
10: R' ← w_original * r + w_K * R_knowledge + w_C * R_causal

11: // Apply clipping to prevent extreme values
12: R' ← clip(R', r_min, r_max)

13: // Update weight adaptation parameters
14: update_weight_parameters(w_K, w_C, performance_metrics)

15: return R'

function compute_knowledge_weight(t, metrics):
16:    // Exponential decay with performance modulation
17:    base_weight ← w_K_0 * exp(-λ_K * t)
18:    performance_factor ← 1 + β_K * (target_performance - current_performance)
19:    return base_weight * performance_factor

function compute_causal_weight(t, metrics):
20:    // Sigmoid growth with confidence modulation
21:    base_weight ← w_C_0 * (1 - exp(-λ_C * t))
22:    confidence_factor ← causal_model_confidence(metrics)
23:    return base_weight * confidence_factor
```

The dynamic weight computation adapts the influence of knowledge and causal components based on training progress and model confidence. Early in training, knowledge receives higher weight, while causal insights become more influential as the causal model becomes more reliable.

