# Comprehensive Causal Analysis Knowledge Base
*Optimized single-file knowledge base for LLM agents*

═══════════════════════════════════════════════════════════════════════════════
# SECTION I: FOUNDATIONAL CAUSAL INFERENCE CONCEPTS
═══════════════════════════════════════════════════════════════════════════════

## Core Falsification Principles & Limitations

### 1. Falsification is One-Sided
• The test can only invalidate a graph by identifying violations of its implied conditional independence relations
• If the graph passes, this does not confirm correctness; it merely indicates that no contradictions were detected with respect to the tested independences

### 2. Inability to Resolve Markov Equivalence
• Many directed acyclic graphs (DAGs) entail the same set of conditional independence constraints
• The test cannot discriminate between such graphs, since they are indistinguishable on the basis of conditional independence relations alone

### 3. Vulnerability to Unobserved Confounding
• If relevant confounders are missing from the data, the test may erroneously validate a misspecified graph
• Hidden variables can therefore cause an incorrect DAG to appear consistent with the observed independences

### 4. Dependence on Measurement Quality
• The test presumes that the observed variables accurately represent the true causal variables
• When variables are noisy, aggregated, or measured through proxies, the implied independence structure may be distorted, leading to spurious validation or falsification

### 5. Benchmarking Against Random Graphs May Be Misleading
• The test includes a comparison against randomly generated DAGs. Even when the resulting p-value is below the threshold (for example p-value <0.05), indicating that the tested graph is not random, the absolute number of conditional independence violations can still be unacceptably high
• In such cases, the tested graph cannot be trusted in practice despite passing the statistical benchmark

### 6. Limitations of Causal Minimality Checks
• The causal minimality component only suggests edge removals, not additions. For example, a DAG with zero edges would pass the causal minimality test yet it is obviously incorrect for any system with dependencies
• Because removing edges increases the number of implied conditional independences, successive iterations can generate different suggestions for removal
• Consequently, the procedure can be unstable, and reliance solely on minimality tests risks oversimplifying the graph
• In practice, decisions about edge removal should be guided by domain knowledge in conjunction with test results

═══════════════════════════════════════════════════════════════════════════════
# SECTION II: FALSIFY_GRAPH() FUNCTION COMPREHENSIVE DOCUMENTATION
═══════════════════════════════════════════════════════════════════════════════

## Function Overview

The falsify_graph() function is DoWhy's primary tool for empirically validating causal directed acyclic graphs (DAGs) against observational data. It performs sophisticated statistical tests to determine whether a proposed causal structure is consistent with the underlying data generating process.

## Function Signature

```python
from dowhy.gcm.falsify import falsify_graph

result = falsify_graph(
    graph,                          # NetworkX DiGraph (required)
    data,                          # pandas DataFrame (required)  
    plot_histogram=False,          # bool (optional)
    suggestions=True,              # bool (optional)
    independence_test=None,        # callable (optional)
    independence_test_kwargs=None, # dict (optional)
    n_permutations=100,           # int (optional)
    max_lag=0,                    # int (optional)
    significance_level=0.05,      # float (optional)
    n_jobs=1,                     # int (optional)
    random_state=None             # int (optional)
)
```

## Core Statistical Concepts

### Markov Conditions
Every causal DAG implies conditional independence relationships that can be empirically tested:

1. **Local Markov Condition**: Each variable is independent of its non-descendants given its parents
2. **Global Markov Condition**: Any two sets of variables are independent given a separating set  
3. **Factorization Condition**: The joint distribution factorizes according to the DAG structure

### Statistical Tests
- **Continuous Variables**: Partial correlation tests using Fisher's z-transformation
- **Discrete Variables**: Chi-square tests of conditional independence
- **Mixed Variables**: Kernel-based tests or regression-based approaches
- **Robust Options**: Rank-based tests for non-parametric alternatives

## Key Parameters Explained

**graph** (NetworkX DiGraph, required)
- Your proposed causal DAG as a NetworkX directed graph
- Node names must exactly match DataFrame column names
- Must be acyclic (no directed cycles)
- Edge direction represents causal direction (X → Y means X causes Y)

**data** (pandas DataFrame, required)  
- Observational dataset with all graph variables as columns
- Minimum recommended: 100 observations
- Missing values should be handled beforehand
- Variables can be continuous, discrete, or mixed

**plot_histogram** (bool, default=False)
- Generates matplotlib visualization of permutation test results
- Shows how your DAG compares to random alternatives
- Useful for visual assessment of DAG quality

**suggestions** (bool, default=True)
- Enables actionable improvement recommendations
- Analyzes violations and suggests specific modifications
- Includes edge additions, removals, and structural changes

**n_permutations** (int, default=100)
- Number of random DAG permutations for baseline comparison
- Higher values provide more reliable p-value estimates
- Computational cost scales linearly
- Recommended range: 50-500

**significance_level** (float, default=0.05)
- Statistical significance threshold for independence tests
- Standard values: 0.01 (strict), 0.05 (standard), 0.10 (lenient)
- Lower values require stronger evidence for relationships

## Output Structure

The function returns comprehensive results including:

### Text Summary Format
```
Graph Falsification Results:
============================

Markov Equivalence Class Analysis:
- The given DAG is [informative/not informative] because X / Y permutations lie in the same Markov equivalence class

Local Markov Condition (LMC) Violations:
- The given DAG violates X/Y LMCs (Z.Z%)
- Variables with violations: [variable_names]

Permutation Comparison:
- Your DAG is better than X% of permuted DAGs (p-value: Y.YY)
- Conclusion: [We reject/do not reject the DAG]

Suggestions for Improvement:
============================
Causal Minimality:
- Remove edge: X → Y (reason: no conditional dependence found)

Causal Sufficiency:  
- Consider unobserved confounder between X and Y
- Add mediating variable: Z

Edge Strength Analysis:
- Strong edges (p < 0.01): [list with p-values]
- Weak edges (p ≥ 0.05): [list with p-values]
```

## Interpretation Guidelines

### Markov Equivalence Class Results

**Informative vs Non-informative DAGs**
- **Highly Informative (< 5% in same class)**: Very specific causal structure with strong identifying restrictions
- **Moderately Informative (5-20%)**: Reasonable specificity with identifying power  
- **Low Informativeness (20-50%)**: Limited ability to distinguish from alternatives
- **Non-informative (> 50%)**: Statistically equivalent to many random alternatives

### LMC Violation Rates
- **Low Violations (< 10%)**: Excellent DAG-data consistency
- **Moderate Violations (10-25%)**: Acceptable performance, minor adjustments may help
- **High Violations (25-50%)**: Significant inconsistency, major revisions needed
- **Severe Violations (> 50%)**: DAG likely fundamentally incorrect

### Permutation Test P-values
- **p < 0.01**: Strong evidence DAG is better than random
- **0.01 ≤ p < 0.05**: Moderate evidence for DAG quality
- **0.05 ≤ p < 0.10**: Weak evidence, interpret cautiously
- **p ≥ 0.10**: No significant difference from random (concerning)

### Edge Strength Classification
- **Strong Edges (p < 0.01)**: High confidence in causal relationship
- **Moderate Edges (0.01 ≤ p < 0.05)**: Reasonable evidence
- **Weak Edges (0.05 ≤ p < 0.10)**: Limited statistical support
- **Non-significant (p ≥ 0.10)**: Little evidence for direct causality

## Example Usage Patterns

### Basic Example
```python
import networkx as nx
import pandas as pd
from dowhy.gcm.falsify import falsify_graph

# Create simple DAG
graph = nx.DiGraph()
graph.add_edges_from([
    ('education', 'income'),
    ('education', 'health'), 
    ('income', 'health'),
    ('age', 'income'),
    ('age', 'health')
])

# Run falsification
result = falsify_graph(
    graph=graph,
    data=data,
    suggestions=True,
    plot_histogram=True,
    n_permutations=200
)

print(result)
```

### Advanced Configuration
```python
# Custom independence test
def custom_test(X, Y, Z, data):
    # Implementation here
    return test_statistic, p_value

# Advanced usage
result = falsify_graph(
    graph=complex_graph,
    data=large_dataset,
    independence_test=custom_test,
    n_permutations=500,
    significance_level=0.01,
    n_jobs=-1,  # Use all CPU cores
    random_state=42
)

# Extract specific results
violation_rate = result.numerical_results['lmc_violations']['violation_rate']
p_value = result.numerical_results['permutation_test']['p_value']
```

## Data Requirements

### Sample Size Guidelines
- **Minimum**: 100 observations for basic reliability
- **Recommended**: 500+ observations for stable results  
- **Rule of thumb**: At least 20 observations per variable
- **High-dimensional**: 1000+ observations for >10 variables

### Data Quality
- **Missing Data**: Handle via complete case analysis or imputation
- **Variable Types**: Continuous, discrete, or mixed supported
- **Distributions**: No strict requirements but heavy skewness may affect power
- **Outliers**: Consider robust methods or careful outlier treatment

### Graph Construction
- **Node Names**: Must exactly match DataFrame column names (case-sensitive)
- **Edge Direction**: Should reflect temporal ordering and causal mechanisms
- **Acyclicity**: Verify with nx.is_directed_acyclic_graph(graph)
- **Theoretical Grounding**: Base structure on domain knowledge

## Troubleshooting Common Issues

### Low Sample Size Warnings
```python
# Check if sample size is adequate
n_vars = len(graph.nodes())
min_recommended = n_vars * 20
if len(data) < min_recommended:
    print(f"Consider increasing sample size to {min_recommended}+")
```

### High Violation Rates
- Check for missing confounders
- Verify correct edge directions
- Consider additional mediating variables
- Review temporal ordering

### Non-significant Permutation Tests
- Increase n_permutations for more stable p-values
- Check if DAG is overly complex for sample size
- Consider simpler model structures
- Verify data quality and preprocessing

## Best Practices

### Iterative Model Development
1. Start with domain knowledge DAG
2. Run falsify_graph() with suggestions=True
3. Evaluate suggestions against domain expertise
4. Modify DAG based on statistical and theoretical evidence
5. Re-test until acceptable performance

### Cross-validation
```python
from sklearn.model_selection import train_test_split

def validate_dag_robustness(graph, data, n_splits=5):
    results = []
    for i in range(n_splits):
        train_data, test_data = train_test_split(data, test_size=0.2, random_state=i)
        test_result = falsify_graph(graph, test_data, suggestions=False)
        results.append(test_result.numerical_results['lmc_violations']['violation_rate'])
    
    return np.mean(results), np.std(results)

mean_vr, std_vr = validate_dag_robustness(my_graph, my_data)
print(f"Mean violation rate: {mean_vr:.2%} ± {std_vr:.2%}")
```

### Performance Optimization
- Use n_jobs=-1 for parallel processing on large datasets
- Start with fewer permutations (n_permutations=50) for initial exploration
- Increase to 200-500 for final validation
- Monitor memory usage with high-dimensional data

## Advanced Topics

### Custom Independence Tests
Create domain-specific tests by implementing the required signature:

```python
def domain_specific_test(X, Y, Z, data):
    """
    Custom conditional independence test
    
    Parameters:
    X, Y: str - Variable names  
    Z: list - Conditioning variables
    data: DataFrame - Dataset
    
    Returns:
    tuple - (test_statistic, p_value)
    """
    # Implementation details
    return stat, p_val
```

### Time Series Applications
```python
# Time series DAG with lags
ts_graph = nx.DiGraph()
ts_graph.add_edges_from([
    ('gdp_t', 'unemployment_t'),
    ('gdp_t-1', 'gdp_t'),  # Lagged effect
    ('interest_rate_t', 'gdp_t')
])

result = falsify_graph(
    graph=ts_graph,
    data=time_series_data,
    max_lag=2  # Allow 2-period lags
)
```

### Handling Mixed Variable Types
The function automatically adapts to different variable types:
- **Continuous**: Uses partial correlation or regression-based tests
- **Categorical**: Employs chi-square or Fisher's exact tests  
- **Ordinal**: Treats as continuous if many levels, categorical if few
- **Binary**: Special handling for rare events

## Statistical Foundation Details

### Conditional Independence Testing
The core of DAG falsification relies on testing conditional independence statements of the form X ⊥ Y | Z, meaning X is independent of Y given Z.

**For Continuous Variables:**
- Partial correlation coefficient: ρ(X,Y|Z)
- Fisher's z-transformation for hypothesis testing
- Test statistic: z = 0.5 * ln((1+ρ)/(1-ρ)) * √(n-|Z|-3)
- Null hypothesis: ρ(X,Y|Z) = 0

**For Discrete Variables:**
- Conditional contingency tables
- Chi-square test of independence within strata defined by Z
- Combines evidence across strata using meta-analysis methods

**For Mixed Types:**
- Regression-based approaches
- Kernel methods for complex relationships
- Mutual information-based tests

### Permutation Testing Framework
1. **Null Hypothesis**: Your DAG is no better than a random DAG with same density
2. **Test Procedure**: 
   - Generate n_permutations random variable orderings
   - Construct DAGs with same number of edges but random structure
   - Compute LMC violation rate for each permuted DAG
   - Compare your DAG's score to this null distribution
3. **P-value Calculation**: Proportion of permuted DAGs performing better than yours

### Score Aggregation Methods
Multiple approaches for combining individual test results:
- **Simple Average**: Mean p-value across all tests
- **Weighted Average**: Weight by degrees of freedom or effect sizes
- **Fisher's Method**: Combine p-values using -2Σln(pi)
- **Stouffer's Method**: Combine z-scores with optional weights

═══════════════════════════════════════════════════════════════════════════════
# SECTION III: DOMAIN-SPECIFIC COH KNOWLEDGE
═══════════════════════════════════════════════════════════════════════════════

## Capped Hours Analytics Summary

### Key Concepts:
- **Capped Out Hours (COH)**: Critical metric for last-mile delivery performance indicating when delivery capacity constraints are reached
- **Station Performance**: Analysis focuses on delivery station operational efficiency and capacity utilization
- **Capacity Management**: Relationship between planned capacity targets and actual operational utilization
- **Root Cause Analysis**: Statistical methods to identify performance drivers and bottlenecks

## Data Quality Notes:
- **Panel Structure**: (station_code × ofd_date) observations
- **Missing Data**: Some backlog variables may have null values
- **Sample Size**: Recommend 100+ observations per station for reliable analysis
- **Seasonality**: Account for peak delivery periods and weather patterns

### Statistical Framework:
- Panel data structure: (station_code × ofd_date) observations
- Non-linear relationships are present
- Potential confounding from unmeasured operational factors
- Temporal lag effects between upstream causes and COH outcomes
- Heterogeneous treatment effects across different station types

## Last Mile Acronyms Summary

### Core Operations:
- **COH**: Capped Out Hours - Critical delivery capacity metric
- **OFD**: Out For Delivery - Package status indicator
- **SLA**: Service Level Agreement - Performance standards
- **KPI**: Key Performance Indicator - Measurement metrics
- **OTD**: On Time Delivery - Timeliness measure
- **DPMO**: Defects Per Million Opportunities - Quality metric

### Capacity & Workforce:
- **DSP**: Delivery Service Partner - External delivery contractors
- **DA**: Delivery Associate - Front-line delivery personnel
- **WFM**: Workforce Management - Staffing optimization
- **PPH**: Packages Per Hour - Productivity measure
- **GPH**: Gallons Per Hour - Fuel efficiency metric
- **TOM**: Target Operating Model - Operational framework

### Systems & Technology:
- **TMS**: Transportation Management System
- **WMS**: Warehouse Management System
- **ETA**: Estimated Time of Arrival
- **GPS**: Global Positioning System
- **API**: Application Programming Interface
- **EDI**: Electronic Data Interchange

### Quality & Performance:
- **CSAT**: Customer Satisfaction Score
- **NPS**: Net Promoter Score
- **FCR**: First Call Resolution
- **MTTR**: Mean Time To Resolution
- **SPC**: Statistical Process Control
- **RCA**: Root Cause Analysis

## Integration Notes for Causal Analysis:
- These documents provide essential domain context for COH causal modeling
- Variable names in datasets correspond to operational metrics described here
- Acronyms help interpret data columns and measurement concepts
- Analytics framework guides statistical modeling approach and variable selection
- Understanding operational context is crucial for proper causal graph specification
- Domain knowledge helps identify potential confounders and mediating variables

# COH Dataset Schema 
*North America Capped Out Hours Dataset - Comprehensive Variable Reference*

## Dataset Overview
- **Purpose**: Analyze Capped Out Hours (COH) in Amazon delivery stations
- **Granularity**: Station-day level observations
- **Geography**: North America (US & Canada)
- **Time Range**: Daily observations with weekly aggregations
- **Key Outcome**: `capped_out_hours` - when delivery capacity constraints are reached

# COH Variable Categories & Causal Relationships

### 1. TEMPORAL DIMENSIONS
**Primary Time Variables:**
- `ofd_date` - Out for Delivery Date (YYYY-MM-DD)
- `ofd_week` - Out for Delivery Week (YYYY-WW)
- `reporting_year_month` - Reporting Year-Month (YYYYMM)

**Derived Time Variables:**
- `month`, `quarter`, `quarter_name` - Standard calendar periods
- `cycle` - ATROPS operational cycle (e.g., CYCLE_1)

### 2. GEOGRAPHIC & ORGANIZATIONAL HIERARCHY
**Geographic Identifiers:**
- `country_code` - US or CA
- `station_code` - Unique delivery station identifier (e.g., DYN9)
- `region` - Local delivery region (e.g., "The Bronx")
- `msa` - Metropolitan Statistical Area
- `regionalized_region` - 10 standardized US regions (e.g., NorthEast)

**Leadership Hierarchy:**
- `location_super_regional_leader` - L10/L8 leader name
- `org_utr_regional_leader` - L8/L7 UTR operations leader  
- `location_sub_super_regional_leader` - Regional Director (RD)

### 3. CORE OUTCOME & CAPACITY METRICS

**PRIMARY OUTCOME (Base Variable):**
- `capped_out_hours` - Station's raw COH (continuous, hours when capacity constraints hit)

**Base Capacity Metrics:**
- `rolling_21_day_caps` - Sum of station's latest capacity (current + last 2 weeks)
- `d1_caps` - Latest capacity (minimum between mechanical, OTR and UTR capacity)
- `w1_caps` - Previous week capacity
- `daily_updated_cap_target` - DUCT signal (capacity target)

**Base Capacity Types:**
- `d1_utr` - Latest UTR (Under The Roof) capacity. The maximum demand that can be handled by the workforce at the station
- `d1_otr` - Latest OTR (Over The Road) capacity. The maximum demand that can be handled by the drivers who deliver the stowed packages (UTR output).
- `d1_mech` - Latest mechanical capacity. The maximum demand that can be handled by the UTR machineries (e.g. conveyor belts).
- `w1_utr`, `w1_otr` - Previous week OTR and UTR capacities

### 4. VOLUME & UTILIZATION METRICS

**Base Volume Indicators:**
- `latest_slammed_volume` - Latest volume processed 
- `latest_tva` - Latest Total Volume Available
- `w1_tva` - Previous week Total Volume Available

**Base Utilization:**
- `latest_utilization` - Latest Capacity Utilization (0-1 scale)

### 5. FORECASTING & PLANNING METRICS

** Planning Variables:**
- `w1_capacity_ask` - Previous week capacity request (based on demand forecast)
- `w3_capacity_ask` - Capacity requested 3 weeks ago (based on demand forecast)
- `w1_cap_target` - Previous week official capacity target. It's equal to w1_capacity_ask + some extra buffer to cover for the possibility of higher demand 


### 6. CONSTRAINT & BACKLOG INDICATORS

**Base Backlogs (KEY CAUSAL VARIABLES):**
- `upstream_backlog` - Packages in transit to station
- `instation_backlog` - Packages at station but not processed
- `total_backlog` - Combined upstream + instation backlog
- `vbl_eod` - Virtual backlog (promises pushed due to capacity constraints)


**Base Constraint Types:**
- `d1_constraint` - Primary constraint type (UTR/OTR/Mech)
- `backlog_flag` - Binary indicator of backlog presence

### 7. WEATHER & EXTERNAL FACTORS

**Base Weather Signals:**
- `weather_signal` - Weather event indicator
- `weather_tier` - weather event severity classification from 0 (no event) to 5 (disruptive event)
- `ofd_weather_flag` - Binary weather flag for OFD
- `prior_3_ofd_weather_flag` - Weather in previous 3 days

**Base Weather History:**
- `previous_ofd_weather_signal` - Weather 1 day prior
- `previous_ofd_weather_signal_2` - Weather 2 days prior  
- `previous_ofd_weather_signal_3` - Weather 3 days prior

### 8. OPERATIONAL EXCLUSIONS & FLAGS

**Base Exclusion Indicators:**
- `cf_exclusion_flag` - Central Flow flex up exclusion
- `co_exclusion_flag` - Central Ops flex up exclusion
- `exclusion_reason` - Reason for exclusion (e.g., "Virtual Node")

**Base Capacity Comparisons:**
- `utr_below_pascal` - UTR below Central Ops recommendation
- `otr_below_pascal` - OTR below Central Ops recommendation
- `latest_caps_vs_duct` - Whether caps met 99% of DUCT signal

### 9. TACTICAL CAPACITY CHANGES

**Base Capacity Adjustments:**
- `caps_change` - Change in caps between W-1 and OFD
- `primary_main_reason` - Specific reason for tactical change
- `bucketed_primary_reason` - Aggregated change reason
- `manual_cap_down` - Whether caps manually reduced

### 10. ROOT CAUSE ATTRIBUTION (V3 LOGIC)

**Primary Root Cause:**
- `main_constraint` - COH root cause (e.g., "Mech Cap", "Weather")
- `main_constraint_bucket` - Aggregated cause category
- `main_constraint_derived` - System-derived root cause

**Weekly Aggregations:**
- `dominant_weekly_main_constraint` - Most frequent weekly constraint
- `dominant_weekly_main_constraint_bucket` - Weekly constraint category

**Manual Overrides:**
- `main_constraint_overridden` - User-overridden root cause
- `main_constraint_override_flag` - Whether manually overridden
- `main_constraint_override_user` - Who made override
- `main_constraint_override_timestamp` - When override occurred

### 11. AGGREGATED METRICS

**Station-Level Aggregations:**
- `coh_wk` - Average COH for the week
- `days_in_week` - Operating days in OFD week
- `coh_x_cap` - COH weighted by capacity
- `w_cap` - Weekly capacity sum
- `w_coh_x_cap` - Weekly COH × capacity sum

**Country-Level Context:**
- `country_rolling_21_day_caps` - National capacity total
- `country_coh` - Station's COH contribution to country
- `country_w_cap` - National weekly capacity
- `wt_coh_w` - Weighted COH contribution

---

## Key Causal Relationships for Modeling

### PRIMARY CAUSAL PATHWAY:
```
External Factors → Capacity Constraints → Capped Out Hours

### **CONTEXTUAL VARIABLES:**
- Geographic and temporal variables for proper identification strategy
- Organizational hierarchy for multilevel modeling
- Historical weather patterns for temporal confounding control

---

## MODELING CONSIDERATIONS:

**Time Dependencies**: Use temporal intuition to identify causal relationships, i.e. a causal must occur earlier than an effect 
**Causal Identification Strategy**: You will be provided with only the variable names. These are the nodes in the causal Directed Acyclic Graph (DAG) you are asked to create.

═══════════════════════════════════════════════════════════════════════════════
# KNOWLEDGE BASE USAGE GUIDE FOR LLM AGENTS
═══════════════════════════════════════════════════════════════════════════════

## Section Organization:
- **Section I**: Foundational causal inference concepts and falsification limitations
- **Section II**: Complete falsify_graph() function documentation and usage
- **Section III**: Domain-specific COH knowledge and operational context

## When to Reference Each Section:
- **Section I**: For understanding causal inference principles and interpretation caveats
- **Section II**: For falsify_graph() implementation, parameters, and results interpretation
- **Section III**: For COH domain context, variable meanings, and operational insights

## Key Integration Points:
- Use Section I principles when interpreting Section II results
- Apply Section III domain knowledge when specifying causal graphs
- Combine all sections for comprehensive causal analysis workflows
