Starting primary regression analysis...
Timestamp: 20250707_120331
Using input file: outputs/step3_heterogeneity_metrics_20250707_115126.csv
Loading data from: outputs/step3_heterogeneity_metrics_20250707_115126.csv
Dataset shape: (1083, 34)
All columns: ['user_id', 'age', 'gender', 'education_level', 'country', 'test_run_id', 'battery_id', 'time_of_day', 'grand_index', 'subtest_36_score', 'subtest_39_score', 'subtest_40_score', 'subtest_29_score', 'subtest_28_score', 'subtest_33_score', 'subtest_30_score', 'subtest_27_score', 'subtest_32_score', 'subtest_38_score', 'subtest_37_score', 'age_bin', 'percentile_36', 'percentile_39', 'percentile_40', 'percentile_29', 'percentile_28', 'percentile_33', 'percentile_30', 'percentile_27', 'percentile_32', 'percentile_38', 'percentile_37', 'percentile_range', 'percentile_iqr']
First 3 rows:
   user_id  age gender  ...  percentile_37 percentile_range  percentile_iqr
0    68983   50      m  ...      67.840376        44.366197       18.779343
1   106315   23      m  ...      60.410095        76.025237       32.255521
2   334338   60      m  ...      27.898551        65.217391       42.572464

[3 rows x 34 columns]
All required columns present
Unique education_level values: [1, 2, 3, 4, 5, 6, 7, 8]
Unique gender values: ['m' 'f']
Unique country values: ['US' 'NZ' 'AU' 'CA']
Unique time_of_day values: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
Unique age_bin values: ['50-59' '18-29' '60-69' '40-49' '30-39' '70-99']
Unique battery_id values: [26]
percentile_range data type: float64
percentile_iqr data type: float64
age data type: int64
grand_index data type: float64
Cleaning data...
Initial number of rows: 1083
Missing values in education_level: 0
Missing values in age: 0
Missing values in gender: 0
Missing values in country: 0
Missing values in time_of_day: 0
Missing values in percentile_range: 0
Missing values in percentile_iqr: 0
Final number of rows: 1083
Excluded rows due to missing values: 0
Preparing data for regression analysis...
Education level distribution:
education_level
1      9
2     95
3    240
4    396
5     63
6    190
7     22
8     68
Name: count, dtype: int64
Gender distribution:
gender
f    546
m    537
Name: count, dtype: int64
Country distribution:
country
US    896
CA    109
AU     65
NZ     13
Name: count, dtype: int64
Time of day distribution:
time_of_day
10    94
20    77
12    76
11    69
22    67
15    65
14    63
21    63
9     58
13    58
17    57
19    56
18    55
16    52
8     46
23    37
7     23
0     13
6     11
4     10
5     10
2      9
1      8
3      6
Name: count, dtype: int64
Country distribution (binarized):
country
US       896
Other    187
Name: count, dtype: int64
Time of day binned distribution:
time_of_day_binned
Afternoon    371
Evening      355
Morning      311
Night         46
Name: count, dtype: int64

============================================================
MODEL 1: PERCENTILE RANGE
============================================================

Running regression model: Model_1_percentile_range
Dependent variable: percentile_range
Formula: percentile_range ~ C(education_level, Treatment(4)) + age + C(gender) + C(country, Treatment('US')) + C(time_of_day_binned, Treatment('Morning'))
Model summary:
                            OLS Regression Results                            
==============================================================================
Dep. Variable:       percentile_range   R-squared:                       0.005
Model:                            OLS   Adj. R-squared:                 -0.007
Method:                 Least Squares   F-statistic:                    0.4483
Date:                Mon, 07 Jul 2025   Prob (F-statistic):              0.952
Time:                        12:03:31   Log-Likelihood:                -4365.6
No. Observations:                1083   AIC:                             8759.
Df Residuals:                    1069   BIC:                             8829.
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
============================================================================================================================
                                                               coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------------------------------------
Intercept                                                   74.0267      1.728     42.840      0.000      70.636      77.417
C(education_level, Treatment(4))[T.1]                       -7.3954      4.655     -1.589      0.112     -16.529       1.738
C(education_level, Treatment(4))[T.2]                        1.7804      1.584      1.124      0.261      -1.327       4.888
C(education_level, Treatment(4))[T.3]                       -0.1016      1.127     -0.090      0.928      -2.313       2.109
C(education_level, Treatment(4))[T.5]                        0.5254      1.880      0.280      0.780      -3.163       4.214
C(education_level, Treatment(4))[T.6]                        0.2723      1.227      0.222      0.824      -2.136       2.680
C(education_level, Treatment(4))[T.7]                       -0.1198      3.021     -0.040      0.968      -6.047       5.807
C(education_level, Treatment(4))[T.8]                        0.4582      1.807      0.254      0.800      -3.087       4.003
C(gender)[T.m]                                              -0.6116      0.886     -0.690      0.490      -2.350       1.127
C(country, Treatment('US'))[T.Other]                         0.6140      1.138      0.539      0.590      -1.620       2.848
C(time_of_day_binned, Treatment('Morning'))[T.Night]        -0.7892      2.177     -0.362      0.717      -5.061       3.483
C(time_of_day_binned, Treatment('Morning'))[T.Afternoon]     0.0152      1.060      0.014      0.989      -2.064       2.094
C(time_of_day_binned, Treatment('Morning'))[T.Evening]      -0.0358      1.073     -0.033      0.973      -2.140       2.069
age                                                          0.0162      0.030      0.546      0.585      -0.042       0.074
==============================================================================
Omnibus:                      114.614   Durbin-Watson:                   1.930
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              151.236
Skew:                          -0.853   Prob(JB):                     1.44e-33
Kurtosis:                       3.663   Cond. No.                         499.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Standardized betas calculated successfully
Shapiro-Wilk test: statistic=0.9543, p-value=0.0000
Breusch-Pagan test: statistic=9.4945, p-value=0.7347
VIF test: max VIF=1.4590
VIF values:
                                             Variable       VIF
0               C(education_level, Treatment(4))[T.1]  1.027847
1               C(education_level, Treatment(4))[T.2]  1.155301
2               C(education_level, Treatment(4))[T.3]  1.260730
3               C(education_level, Treatment(4))[T.5]  1.114152
4               C(education_level, Treatment(4))[T.6]  1.253945
5               C(education_level, Treatment(4))[T.7]  1.045288
6               C(education_level, Treatment(4))[T.8]  1.105418
7                                      C(gender)[T.m]  1.129426
8                C(country, Treatment('US'))[T.Other]  1.065775
9   C(time_of_day_binned, Treatment('Morning'))[T....  1.109802
10  C(time_of_day_binned, Treatment('Morning'))[T....  1.455659
11  C(time_of_day_binned, Treatment('Morning'))[T....  1.458951
12                                                age  1.167839
Durbin-Watson test: statistic=1.9299

============================================================
MODEL 2: PERCENTILE IQR
============================================================

Running regression model: Model_2_percentile_iqr
Dependent variable: percentile_iqr
Formula: percentile_iqr ~ C(education_level, Treatment(4)) + age + C(gender) + C(country, Treatment('US')) + C(time_of_day_binned, Treatment('Morning'))
Model summary:
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         percentile_iqr   R-squared:                       0.012
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                    0.9817
Date:                Mon, 07 Jul 2025   Prob (F-statistic):              0.467
Time:                        12:03:31   Log-Likelihood:                -4185.0
No. Observations:                1083   AIC:                             8398.
Df Residuals:                    1069   BIC:                             8468.
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
============================================================================================================================
                                                               coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------------------------------------
Intercept                                                   34.9546      1.462     23.901      0.000      32.085      37.824
C(education_level, Treatment(4))[T.1]                       -2.6638      3.940     -0.676      0.499     -10.394       5.067
C(education_level, Treatment(4))[T.2]                        0.3739      1.340      0.279      0.780      -2.256       3.004
C(education_level, Treatment(4))[T.3]                       -0.5454      0.954     -0.572      0.568      -2.417       1.326
C(education_level, Treatment(4))[T.5]                        0.4984      1.591      0.313      0.754      -2.623       3.620
C(education_level, Treatment(4))[T.6]                       -1.0296      1.039     -0.991      0.322      -3.068       1.008
C(education_level, Treatment(4))[T.7]                        1.6634      2.557      0.651      0.515      -3.353       6.680
C(education_level, Treatment(4))[T.8]                       -0.9248      1.529     -0.605      0.545      -3.925       2.075
C(gender)[T.m]                                               1.3835      0.750      1.845      0.065      -0.088       2.855
C(country, Treatment('US'))[T.Other]                         0.0775      0.964      0.080      0.936      -1.813       1.968
C(time_of_day_binned, Treatment('Morning'))[T.Night]        -3.8589      1.843     -2.094      0.036      -7.475      -0.243
C(time_of_day_binned, Treatment('Morning'))[T.Afternoon]    -1.0950      0.897     -1.221      0.222      -2.855       0.665
C(time_of_day_binned, Treatment('Morning'))[T.Evening]      -1.3837      0.908     -1.524      0.128      -3.165       0.397
age                                                         -0.0076      0.025     -0.302      0.763      -0.057       0.042
==============================================================================
Omnibus:                        9.315   Durbin-Watson:                   2.010
Prob(Omnibus):                  0.009   Jarque-Bera (JB):                9.368
Skew:                           0.212   Prob(JB):                      0.00924
Kurtosis:                       2.832   Cond. No.                         499.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Standardized betas calculated successfully
Shapiro-Wilk test: statistic=0.9950, p-value=0.0012
Breusch-Pagan test: statistic=9.3037, p-value=0.7496
VIF test: max VIF=1.4590
VIF values:
                                             Variable       VIF
0               C(education_level, Treatment(4))[T.1]  1.027847
1               C(education_level, Treatment(4))[T.2]  1.155301
2               C(education_level, Treatment(4))[T.3]  1.260730
3               C(education_level, Treatment(4))[T.5]  1.114152
4               C(education_level, Treatment(4))[T.6]  1.253945
5               C(education_level, Treatment(4))[T.7]  1.045288
6               C(education_level, Treatment(4))[T.8]  1.105418
7                                      C(gender)[T.m]  1.129426
8                C(country, Treatment('US'))[T.Other]  1.065775
9   C(time_of_day_binned, Treatment('Morning'))[T....  1.109802
10  C(time_of_day_binned, Treatment('Morning'))[T....  1.455659
11  C(time_of_day_binned, Treatment('Morning'))[T....  1.458951
12                                                age  1.167839
Durbin-Watson test: statistic=2.0103
Saving results to CSV files...
Regression results saved to: outputs/step4_primary_regression_results_20250707_120331.csv
Regression results columns: ['model_name', 'dependent_variable', 'predictor_variable', 'coefficient', 'std_error', 'standardized_beta', 't_statistic', 'p_value', 'conf_int_lower', 'conf_int_upper', 'significant_bonferroni']
First 2 rows of regression results:
                 model_name  ... significant_bonferroni
0  Model_1_percentile_range  ...                   True
1  Model_1_percentile_range  ...                  False

[2 rows x 11 columns]
Assumption test results saved to: outputs/step4_assumption_tests_20250707_120331.csv
Assumption test results columns: ['model_name', 'test_name', 'test_statistic', 'p_value', 'assumption_met']
First 2 rows of assumption test results:
                 model_name      test_name  ...       p_value  assumption_met
0  Model_1_percentile_range   Shapiro-Wilk  ...  7.354439e-18           False
1  Model_1_percentile_range  Breusch-Pagan  ...  7.346561e-01            True

[2 rows x 5 columns]
Model fit results saved to: outputs/step4_model_fit_20250707_120331.csv
Model fit results columns: ['model_name', 'dependent_variable', 'r_squared', 'adjusted_r_squared', 'f_statistic', 'f_p_value', 'aic', 'bic', 'n_observations']
First 2 rows of model fit results:
                 model_name dependent_variable  ...          bic  n_observations
0  Model_1_percentile_range   percentile_range  ...  8829.097765            1083
1    Model_2_percentile_iqr     percentile_iqr  ...  8467.771031            1083

[2 rows x 9 columns]

============================================================
ANALYSIS SUMMARY
============================================================
Total regression coefficients analyzed: 28
Total assumption tests performed: 8
Total models fitted: 2

Significant results (Bonferroni corrected, α = 0.025): 2
  Model_1_percentile_range: Intercept (β = 74.0267, p = 0.0000)
  Model_2_percentile_iqr: Intercept (β = 34.9546, p = 0.0000)

Assumption test summary:
  Model_1_percentile_range - Shapiro-Wilk: FAIL (p = 0.0000)
  Model_1_percentile_range - Breusch-Pagan: PASS (p = 0.7347)
  Model_1_percentile_range - VIF: PASS (statistic = 1.4590)
  Model_1_percentile_range - Durbin-Watson: PASS (statistic = 1.9299)
  Model_2_percentile_iqr - Shapiro-Wilk: FAIL (p = 0.0012)
  Model_2_percentile_iqr - Breusch-Pagan: PASS (p = 0.7496)
  Model_2_percentile_iqr - VIF: PASS (statistic = 1.4590)
  Model_2_percentile_iqr - Durbin-Watson: PASS (statistic = 2.0103)
Finished execution

