
============================================================
Starting Battery 26 data processing pipeline...
============================================================
Loading data from: battery26_df.csv
Data loaded successfully. Shape: (14811, 11)
Columns: ['user_id', 'age', 'gender', 'education_level', 'country', 'test_run_id', 'battery_id', 'specific_subtest_id', 'raw_score', 'time_of_day', 'grand_index']
First 3 rows:
   user_id   age gender  ...  raw_score time_of_day  grand_index
0    68983  50.0      m  ...       10.0          13   106.315883
1    68983  50.0      m  ...       19.0          13   106.315883
2    68983  50.0      m  ...       28.0          13   106.315883

[3 rows x 11 columns]
All required columns present.
Unique battery_id values: [26]
Unique specific_subtest_id values: [27, 28, 29, 30, 32, 33, 36, 37, 38, 39, 40]
Unique time_of_day values: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
Unique gender values: ['m' nan 'f']
Unique education_level values: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 99.0, nan]
Age data type: float64
Raw_score data type: float64
Grand_index data type: float64
Expected subtests: [36, 39, 40, 29, 28, 33, 30, 27, 32, 38, 37]
Actual subtests: [27, 28, 29, 30, 32, 33, 36, 37, 38, 39, 40]
All expected subtests are present.
Reshaping data from long to wide format...
Removed 0 rows with null values in critical columns
Wide format data shape: (1504, 20)
Wide format columns: ['user_id', 'age', 'gender', 'education_level', 'country', 'test_run_id', 'battery_id', 'time_of_day', 'grand_index', 'subtest_27_score', 'subtest_28_score', 'subtest_29_score', 'subtest_30_score', 'subtest_32_score', 'subtest_33_score', 'subtest_36_score', 'subtest_37_score', 'subtest_38_score', 'subtest_39_score', 'subtest_40_score']
First 5 rows of wide format data:
   user_id   age gender  education_level country  test_run_id  battery_id  \
0    68983  50.0      m              6.0      US       251259          26   
1    96745  35.0    NaN              2.0      US       595741          26   
2   106315  23.0      m              8.0      US       614129          26   
3   314623  25.0    NaN              2.0      AU       169122          26   
4   334338  60.0      m              4.0      US       167761          26   

   time_of_day  grand_index  subtest_27_score  subtest_28_score  \
0           13   106.315883               8.0               6.0   
1            8   104.653922               8.0               6.0   
2           19   107.930416              10.0               7.0   
3           10   109.741965               9.0               6.0   
4            5    93.832528               7.0               6.0   

   subtest_29_score  subtest_30_score  subtest_32_score  subtest_33_score  \
0              17.0               9.0             392.0               5.0   
1              21.0              10.0             403.0               4.0   
2              13.0               7.0             397.0               6.0   
3              17.0               3.0             397.0               7.0   
4              16.0               8.0             466.0               4.0   

   subtest_36_score  subtest_37_score  subtest_38_score  subtest_39_score  \
0              10.0              10.0              40.0              19.0   
1               8.0               8.0              45.0              17.0   
2              10.0              10.0              48.0              21.0   
3               7.0               5.0              67.0              13.0   
4               8.0               7.0              40.0              20.0   

   subtest_40_score  
0              28.0  
1              27.0  
2              43.0  
3              25.0  
4              28.0  

============================================================
Applying exclusion criteria...
============================================================
Initial participant count: 1504

=== TIMING-BASED EXCLUSIONS ===
Excluded 0 participants taking >15 minutes on Trail Making A or B

=== COMPLETION-BASED EXCLUSIONS ===
Excluded 318 participants missing grand_index scores
Excluded 75 participants missing essential demographic data
Excluded 19 participants with education_level = 99
Excluded 0 participants missing >2 subtests

=== PERFORMANCE VALIDITY EXCLUSIONS ===
Excluded 0 participants with identical scores across ≥8 subtests

=== REVERSE SCORING TIME-BASED SUBTESTS ===
Age bin distribution:
age_bin
18-29    320
30-39    216
40-49    170
50-59    215
60-69    139
70-99     32
Name: count, dtype: int64

Reverse scoring subtest_32_score...
  Age bin 18-29: max score = 673.0
  Age bin 30-39: max score = 665.0
  Age bin 40-49: max score = 804.0
  Age bin 50-59: max score = 778.0
  Age bin 60-69: max score = 653.0
  Age bin 70-99: max score = 736.0

Reverse scoring subtest_39_score...
  Age bin 18-29: max score = 64.0
  Age bin 30-39: max score = 47.0
  Age bin 40-49: max score = 177.0
  Age bin 50-59: max score = 74.0
  Age bin 60-69: max score = 182.0
  Age bin 70-99: max score = 94.0

Reverse scoring subtest_40_score...
  Age bin 18-29: max score = 98.0
  Age bin 30-39: max score = 105.0
  Age bin 40-49: max score = 78.0
  Age bin 50-59: max score = 130.0
  Age bin 60-69: max score = 181.0
  Age bin 70-99: max score = 203.0

First 5 rows after reverse scoring:
   user_id  age gender  education_level country  test_run_id  battery_id  \
0    68983   50      m                6      US       251259          26   
2   106315   23      m                8      US       614129          26   
4   334338   60      m                4      US       167761          26   
5   477486   49      f                1      NZ       284941          26   
6   486561   39      f                2      NZ       720979          26   

   time_of_day  grand_index  subtest_27_score  subtest_28_score  \
0           13   106.315883               8.0               6.0   
2           19   107.930416              10.0               7.0   
4            5    93.832528               7.0               6.0   
5           20   111.918125               9.0               6.0   
6            6   102.732932               8.0               7.0   

   subtest_29_score  subtest_30_score  subtest_32_score  subtest_33_score  \
0              17.0               9.0             387.0               5.0   
2              13.0               7.0             277.0               6.0   
4              16.0               8.0             188.0               4.0   
5              19.0              13.0             360.0               7.0   
6              10.0               5.0              88.0               6.0   

   subtest_36_score  subtest_37_score  subtest_38_score  subtest_39_score  \
0              10.0              10.0              40.0              56.0   
2              10.0              10.0              48.0              44.0   
4               8.0               7.0              40.0             163.0   
5              12.0              11.0              42.0             150.0   
6              11.0              12.0              51.0              28.0   

   subtest_40_score age_bin  
0             103.0   50-59  
2              56.0   18-29  
4             154.0   60-69  
5              45.0   40-49  
6              71.0   30-39  

=== OUTLIER EXCLUSIONS ===

Processing age bin: 18-29
  subtest_36_score: 4 outliers detected
  subtest_39_score: 5 outliers detected
  subtest_40_score: 5 outliers detected
  subtest_29_score: 1 outliers detected
  subtest_28_score: 0 outliers detected
  subtest_33_score: 2 outliers detected
  subtest_30_score: 1 outliers detected
  subtest_27_score: 1 outliers detected
  subtest_32_score: 3 outliers detected
  subtest_38_score: 0 outliers detected
  subtest_37_score: 2 outliers detected

Processing age bin: 30-39
  subtest_36_score: 4 outliers detected
  subtest_39_score: 5 outliers detected
  subtest_40_score: 4 outliers detected
  subtest_29_score: 1 outliers detected
  subtest_28_score: 0 outliers detected
  subtest_33_score: 1 outliers detected
  subtest_30_score: 1 outliers detected
  subtest_27_score: 0 outliers detected
  subtest_32_score: 2 outliers detected
  subtest_38_score: 0 outliers detected
  subtest_37_score: 1 outliers detected

Processing age bin: 40-49
  subtest_36_score: 2 outliers detected
  subtest_39_score: 2 outliers detected
  subtest_40_score: 3 outliers detected
  subtest_29_score: 1 outliers detected
  subtest_28_score: 0 outliers detected
  subtest_33_score: 0 outliers detected
  subtest_30_score: 0 outliers detected
  subtest_27_score: 0 outliers detected
  subtest_32_score: 2 outliers detected
  subtest_38_score: 0 outliers detected
  subtest_37_score: 0 outliers detected

Processing age bin: 50-59
  subtest_36_score: 5 outliers detected
  subtest_39_score: 6 outliers detected
  subtest_40_score: 6 outliers detected
  subtest_29_score: 1 outliers detected
  subtest_28_score: 0 outliers detected
  subtest_33_score: 0 outliers detected
  subtest_30_score: 0 outliers detected
  subtest_27_score: 1 outliers detected
  subtest_32_score: 6 outliers detected
  subtest_38_score: 4 outliers detected
  subtest_37_score: 1 outliers detected

Processing age bin: 60-69
  subtest_36_score: 3 outliers detected
  subtest_39_score: 3 outliers detected
  subtest_40_score: 2 outliers detected
  subtest_29_score: 0 outliers detected
  subtest_28_score: 0 outliers detected
  subtest_33_score: 0 outliers detected
  subtest_30_score: 0 outliers detected
  subtest_27_score: 1 outliers detected
  subtest_32_score: 0 outliers detected
  subtest_38_score: 0 outliers detected
  subtest_37_score: 0 outliers detected

Processing age bin: 70-99
  subtest_36_score: 1 outliers detected
  subtest_39_score: 0 outliers detected
  subtest_40_score: 1 outliers detected
  subtest_29_score: 0 outliers detected
  subtest_28_score: 0 outliers detected
  subtest_33_score: 0 outliers detected
  subtest_30_score: 0 outliers detected
  subtest_27_score: 0 outliers detected
  subtest_32_score: 0 outliers detected
  subtest_38_score: 0 outliers detected
  subtest_37_score: 0 outliers detected

Excluded 9 participants flagged as outliers on ≥2 subtests

First 5 rows after outlier exclusion:
   user_id  age gender  education_level country  test_run_id  battery_id  \
0    68983   50      m                6      US       251259          26   
2   106315   23      m                8      US       614129          26   
4   334338   60      m                4      US       167761          26   
5   477486   49      f                1      NZ       284941          26   
6   486561   39      f                2      NZ       720979          26   

   time_of_day  grand_index  subtest_27_score  subtest_28_score  \
0           13   106.315883               8.0               6.0   
2           19   107.930416              10.0               7.0   
4            5    93.832528               7.0               6.0   
5           20   111.918125               9.0               6.0   
6            6   102.732932               8.0               7.0   

   subtest_29_score  subtest_30_score  subtest_32_score  subtest_33_score  \
0              17.0               9.0             387.0               5.0   
2              13.0               7.0             277.0               6.0   
4              16.0               8.0             188.0               4.0   
5              19.0              13.0             360.0               7.0   
6              10.0               5.0              88.0               6.0   

   subtest_36_score  subtest_37_score  subtest_38_score  subtest_39_score  \
0              10.0              10.0              40.0              56.0   
2              10.0              10.0              48.0              44.0   
4               8.0               7.0              40.0             163.0   
5              12.0              11.0              42.0             150.0   
6              11.0              12.0              51.0              28.0   

   subtest_40_score age_bin  
0             103.0   50-59  
2              56.0   18-29  
4             154.0   60-69  
5              45.0   40-49  
6              71.0   30-39  

=== CREATING FINAL DATASET ===
Final dataset shape: (1083, 21)
Final dataset columns: ['user_id', 'test_run_id', 'age', 'gender', 'education_level', 'country', 'battery_id', 'time_of_day', 'grand_index', 'subtest_36_score', 'subtest_39_score', 'subtest_40_score', 'subtest_29_score', 'subtest_28_score', 'subtest_33_score', 'subtest_30_score', 'subtest_27_score', 'subtest_32_score', 'subtest_38_score', 'subtest_37_score', 'age_bin']
First 5 rows of final dataset:
   user_id  test_run_id  age gender  education_level country  battery_id  \
0    68983       251259   50      m                6      US          26   
2   106315       614129   23      m                8      US          26   
4   334338       167761   60      m                4      US          26   
5   477486       284941   49      f                1      NZ          26   
6   486561       720979   39      f                2      NZ          26   

   time_of_day  grand_index  subtest_36_score  subtest_39_score  \
0           13   106.315883              10.0              56.0   
2           19   107.930416              10.0              44.0   
4            5    93.832528               8.0             163.0   
5           20   111.918125              12.0             150.0   
6            6   102.732932              11.0              28.0   

   subtest_40_score  subtest_29_score  subtest_28_score  subtest_33_score  \
0             103.0              17.0               6.0               5.0   
2              56.0              13.0               7.0               6.0   
4             154.0              16.0               6.0               4.0   
5              45.0              19.0               6.0               7.0   
6              71.0              10.0               7.0               6.0   

   subtest_30_score  subtest_27_score  subtest_32_score  subtest_38_score  \
0               9.0               8.0             387.0              40.0   
2               7.0              10.0             277.0              48.0   
4               8.0               7.0             188.0              40.0   
5              13.0               9.0             360.0              42.0   
6               5.0               8.0              88.0              51.0   

   subtest_37_score age_bin  
0              10.0   50-59  
2              10.0   18-29  
4               7.0   60-69  
5              11.0   40-49  
6              12.0   30-39  

============================================================
SAVING OUTPUTS
============================================================
Saved final dataset to: outputs/step1_cleaned_battery26_data_20250707_113751.csv
Saved exclusion log to: outputs/step1_exclusion_log_20250707_113751.csv

Exclusion Summary:
                 exclusion_criteria  total_excluded
0          Trail Making >15 minutes               0
1               Missing grand_index             318
2              Missing demographics              75
3                Education level 99              19
4               Missing >2 subtests               0
5      Identical scores ≥8 subtests               0
6  Statistical outliers ≥2 subtests               9

Final participant count: 1083

============================================================
FINAL SUMMARY STATISTICS
============================================================
Initial participant count: 1504

Exclusion breakdown:
                 exclusion_criteria  total_excluded
0          Trail Making >15 minutes               0
1               Missing grand_index             318
2              Missing demographics              75
3                Education level 99              19
4               Missing >2 subtests               0
5      Identical scores ≥8 subtests               0
6  Statistical outliers ≥2 subtests               9

Final participant count: 1083

Final dataset info:
<class 'pandas.core.frame.DataFrame'>
Index: 1083 entries, 0 to 1503
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   user_id           1083 non-null   int64   
 1   test_run_id       1083 non-null   int64   
 2   age               1083 non-null   int64   
 3   gender            1083 non-null   object  
 4   education_level   1083 non-null   int64   
 5   country           1083 non-null   object  
 6   battery_id        1083 non-null   int64   
 7   time_of_day       1083 non-null   int64   
 8   grand_index       1083 non-null   float64 
 9   subtest_36_score  1083 non-null   float64 
 10  subtest_39_score  1083 non-null   float64 
 11  subtest_40_score  1083 non-null   float64 
 12  subtest_29_score  1083 non-null   float64 
 13  subtest_28_score  1083 non-null   float64 
 14  subtest_33_score  1083 non-null   float64 
 15  subtest_30_score  1083 non-null   float64 
 16  subtest_27_score  1083 non-null   float64 
 17  subtest_32_score  1083 non-null   float64 
 18  subtest_38_score  1083 non-null   float64 
 19  subtest_37_score  1083 non-null   float64 
 20  age_bin           1083 non-null   category
dtypes: category(1), float64(12), int64(6), object(2)
memory usage: 179.0+ KB
None

First 5 rows of final dataset:
   user_id  test_run_id  age gender  education_level country  battery_id  \
0    68983       251259   50      m                6      US          26   
2   106315       614129   23      m                8      US          26   
4   334338       167761   60      m                4      US          26   
5   477486       284941   49      f                1      NZ          26   
6   486561       720979   39      f                2      NZ          26   

   time_of_day  grand_index  subtest_36_score  subtest_39_score  \
0           13   106.315883              10.0              56.0   
2           19   107.930416              10.0              44.0   
4            5    93.832528               8.0             163.0   
5           20   111.918125              12.0             150.0   
6            6   102.732932              11.0              28.0   

   subtest_40_score  subtest_29_score  subtest_28_score  subtest_33_score  \
0             103.0              17.0               6.0               5.0   
2              56.0              13.0               7.0               6.0   
4             154.0              16.0               6.0               4.0   
5              45.0              19.0               6.0               7.0   
6              71.0              10.0               7.0               6.0   

   subtest_30_score  subtest_27_score  subtest_32_score  subtest_38_score  \
0               9.0               8.0             387.0              40.0   
2               7.0              10.0             277.0              48.0   
4               8.0               7.0             188.0              40.0   
5              13.0               9.0             360.0              42.0   
6               5.0               8.0              88.0              51.0   

   subtest_37_score age_bin  
0              10.0   50-59  
2              10.0   18-29  
4               7.0   60-69  
5              11.0   40-49  
6              12.0   30-39  

Descriptive statistics (age, education_level):
               age  education_level
count  1083.000000      1083.000000
mean     41.863343         4.299169
std      15.241972         1.583495
min      18.000000         1.000000
25%      28.000000         3.000000
50%      41.000000         4.000000
75%      54.000000         6.000000
max      90.000000         8.000000

Gender distribution:
gender
f    546
m    537
Name: count, dtype: int64

Country distribution (top 10):
country
US    896
CA    109
AU     65
NZ     13
Name: count, dtype: int64

Age bin distribution:
age_bin
18-29    317
30-39    213
40-49    170
50-59    213
60-69    138
70-99     32
Name: count, dtype: int64
============================================================

Finished execution

