[2026-01-14 10:18:50] [INFO] ============================================================
[2026-01-14 10:18:50] [INFO] NHANES DATA PREPROCESSING PIPELINE
[2026-01-14 10:18:50] [INFO] Cycles: ['E', 'F', 'G', 'H', 'I', 'J']
[2026-01-14 10:18:50] [INFO] ============================================================
[2026-01-14 10:18:50] [INFO] 
--- STEP 1: Loading Data ---
[2026-01-14 10:18:50] [INFO] Loading demographic data...
[2026-01-14 10:18:50] [INFO]   Loaded DEMO_E.xpt: 10149 rows, 44 columns
[2026-01-14 10:18:50] [INFO]   Loaded DEMO_F.xpt: 10537 rows, 44 columns
[2026-01-14 10:18:50] [INFO]   Loaded DEMO_G.xpt: 9756 rows, 49 columns
[2026-01-14 10:18:50] [INFO]   Loaded DEMO_H.xpt: 10175 rows, 48 columns
[2026-01-14 10:18:50] [INFO]   Loaded DEMO_I.xpt: 9971 rows, 48 columns
[2026-01-14 10:18:50] [INFO]   Loaded DEMO_J.xpt: 9254 rows, 47 columns
[2026-01-14 10:18:50] [INFO] Demographics: 59842 unique participants
[2026-01-14 10:18:50] [INFO] Loading biochemistry data...
[2026-01-14 10:18:50] [INFO]   Loaded BIOPRO_E.xpt: 6917 rows, 38 columns
[2026-01-14 10:18:50] [INFO]   Loaded BIOPRO_F.xpt: 7369 rows, 38 columns
[2026-01-14 10:18:50] [INFO]   Loaded BIOPRO_G.xpt: 6549 rows, 39 columns
[2026-01-14 10:18:50] [INFO]   Loaded BIOPRO_H.xpt: 6979 rows, 39 columns
[2026-01-14 10:18:50] [INFO]   Loaded BIOPRO_I.xpt: 6744 rows, 39 columns
[2026-01-14 10:18:50] [INFO]   Loaded BIOPRO_J.xpt: 6401 rows, 42 columns
[2026-01-14 10:18:50] [INFO] Biochemistry: 40959 unique participants
[2026-01-14 10:18:50] [INFO] Loading CBC data...
[2026-01-14 10:18:50] [INFO]   Loaded CBC_E.xpt: 9307 rows, 22 columns
[2026-01-14 10:18:50] [INFO]   Loaded CBC_F.xpt: 9835 rows, 22 columns
[2026-01-14 10:18:50] [INFO]   Loaded CBC_G.xpt: 8956 rows, 22 columns
[2026-01-14 10:18:50] [INFO]   Loaded CBC_H.xpt: 9422 rows, 22 columns
[2026-01-14 10:18:50] [INFO]   Loaded CBC_I.xpt: 9165 rows, 22 columns
[2026-01-14 10:18:50] [INFO]   Loaded CBC_J.xpt: 8366 rows, 23 columns
[2026-01-14 10:18:50] [INFO] CBC: 55051 unique participants
[2026-01-14 10:18:50] [INFO] Loading body measurement data...
[2026-01-14 10:18:50] [INFO]   Loaded BMX_E.xpt: 9762 rows, 24 columns
[2026-01-14 10:18:50] [INFO]   Loaded BMX_F.xpt: 10253 rows, 24 columns
[2026-01-14 10:18:50] [INFO]   Loaded BMX_G.xpt: 9338 rows, 27 columns
[2026-01-14 10:18:50] [INFO]   Loaded BMX_H.xpt: 9813 rows, 27 columns
[2026-01-14 10:18:50] [INFO]   Loaded BMX_I.xpt: 9544 rows, 27 columns
[2026-01-14 10:18:50] [INFO]   Loaded BMX_J.xpt: 8704 rows, 22 columns
[2026-01-14 10:18:50] [INFO] Body measures: 57414 unique participants
[2026-01-14 10:18:50] [INFO] Loading blood pressure data...
[2026-01-14 10:18:50] [INFO]   Loaded BPX_E.xpt: 9762 rows, 28 columns
[2026-01-14 10:18:50] [INFO]   Loaded BPX_F.xpt: 10253 rows, 28 columns
[2026-01-14 10:18:51] [INFO]   Loaded BPX_G.xpt: 9338 rows, 28 columns
[2026-01-14 10:18:51] [INFO]   Loaded BPX_H.xpt: 9813 rows, 24 columns
[2026-01-14 10:18:51] [INFO]   Loaded BPX_I.xpt: 9544 rows, 22 columns
[2026-01-14 10:18:51] [INFO]   Loaded BPX_J.xpt: 8704 rows, 22 columns
[2026-01-14 10:18:51] [INFO] Blood pressure: 57414 unique participants
[2026-01-14 10:18:51] [INFO] Loading CRP data...
[2026-01-14 10:18:51] [INFO]   Loaded CRP_E.xpt: 8712 rows
[2026-01-14 10:18:51] [INFO]   Loaded CRP_F.xpt: 9211 rows
[2026-01-14 10:18:51] [INFO]   Loaded HSCRP_I.xpt: 9165 rows
[2026-01-14 10:18:51] [INFO]   Loaded HSCRP_J.xpt: 8366 rows
[2026-01-14 10:18:51] [INFO] CRP: 35454 unique participants
[2026-01-14 10:18:51] [INFO] Loading mortality data...
[2026-01-14 10:18:51] [INFO]   Loaded NHANES_2007_2008_MORT_2019_PUBLIC.dat: 10149 rows
[2026-01-14 10:18:51] [INFO]   Loaded NHANES_2009_2010_MORT_2019_PUBLIC.dat: 10537 rows
[2026-01-14 10:18:51] [INFO]   Loaded NHANES_2011_2012_MORT_2019_PUBLIC.dat: 9756 rows
[2026-01-14 10:18:51] [INFO]   Loaded NHANES_2013_2014_MORT_2019_PUBLIC.dat: 10175 rows
[2026-01-14 10:18:51] [INFO]   Loaded NHANES_2015_2016_MORT_2019_PUBLIC.dat: 9971 rows
[2026-01-14 10:18:51] [INFO]   Loaded NHANES_2017_2018_MORT_2019_PUBLIC.dat: 9254 rows
[2026-01-14 10:18:51] [INFO] 
--- STEP 2: Merging Datasets ---
[2026-01-14 10:18:51] [INFO] Merged Biochemistry: 59842 rows
[2026-01-14 10:18:51] [INFO] Merged CBC: 59842 rows
[2026-01-14 10:18:51] [INFO] Merged Body measures: 59842 rows
[2026-01-14 10:18:51] [INFO] Merged Blood pressure: 59842 rows
[2026-01-14 10:18:51] [INFO] Merged CRP: 59842 rows
[2026-01-14 10:18:51] [INFO] Combined dataset: 59842 rows, 124 columns
[2026-01-14 10:18:51] [INFO] 
--- STEP 3: Constructing Survival Outcomes ---
[2026-01-14 10:18:51] [INFO] Constructing survival outcomes...
[2026-01-14 10:18:51] [INFO] After mortality merge: 59842 rows
[2026-01-14 10:18:51] [INFO] After ELIGSTAT=1 filter: 36461 rows
[2026-01-14 10:18:52] [INFO] Survival data: 3262 deaths, 31799 censored
[2026-01-14 10:18:52] [INFO] Cause of death distribution:
[2026-01-14 10:18:52] [INFO]   1 - Heart disease: 788 (24.2%)
[2026-01-14 10:18:52] [INFO]   2 - Cancer: 778 (23.9%)
[2026-01-14 10:18:52] [INFO]   3 - Chronic lower respiratory disease: 160 (4.9%)
[2026-01-14 10:18:52] [INFO]   4 - Accidents: 93 (2.9%)
[2026-01-14 10:18:52] [INFO]   5 - Cerebrovascular disease: 170 (5.2%)
[2026-01-14 10:18:52] [INFO]   6 - Alzheimer's disease: 86 (2.6%)
[2026-01-14 10:18:52] [INFO]   7 - Diabetes: 107 (3.3%)
[2026-01-14 10:18:52] [INFO]   8 - Influenza/Pneumonia: 64 (2.0%)
[2026-01-14 10:18:52] [INFO]   9 - Nephritis/Kidney disease: 58 (1.8%)
[2026-01-14 10:18:52] [INFO]   10 - Other causes: 958 (29.4%)
[2026-01-14 10:18:52] [INFO] 
--- STEP 4: Feature Engineering ---
[2026-01-14 10:18:52] [INFO] Computed NLR (Neutrophil-Lymphocyte Ratio)
[2026-01-14 10:18:52] [INFO] Computed WHR (Waist-Hip Ratio)
[2026-01-14 10:18:52] [INFO] Computed SBP_mean from 4 readings
[2026-01-14 10:18:52] [INFO] Computed DBP_mean from 4 readings
[2026-01-14 10:18:52] [INFO] Computed MAP (Mean Arterial Pressure)
[2026-01-14 10:18:52] [INFO] Computed PP (Pulse Pressure)
[2026-01-14 10:18:52] [INFO] Computed de_ritis_ratio (AST/ALT)
[2026-01-14 10:18:52] [INFO] 
--- STEP 5: Data Cleaning ---
[2026-01-14 10:18:52] [INFO] Total feature columns: 129
[2026-01-14 10:18:52] [INFO] Winsorized 91 features
[2026-01-14 10:18:52] [INFO] Log-transformed 5 features
[2026-01-14 10:18:52] [WARNING]   Skipping LBDSATLC: 85.4% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping LBDSGTLC: 85.4% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping LBDSTBLC: 85.4% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping LBXNRBC: 85.1% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BMIWT: 96.0% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BMXRECUM: 100.0% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BMIRECUM: 100.0% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BMXHEAD: 100.0% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BMIHEAD: 100.0% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BMIHT: 96.8% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BMILEG: 95.0% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BMIARML: 96.2% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BMIARMC: 96.1% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BMIWAIST: 95.1% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BMXTRI: 68.2% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BMITRI: 96.9% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BMXSUB: 71.6% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BMISUB: 93.6% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BMDBMIC: 97.6% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BMXSAD1: 55.2% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BMXSAD2: 55.2% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BMXSAD3: 97.5% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BMXSAD4: 97.5% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BMDAVSAD: 55.2% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BMDSADCM: 96.7% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BMXHIP: 85.3% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BMIHIP: 99.3% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping PEASCCT1: 95.8% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BPXCHR: 100.0% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BPQ150A: 51.2% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BPQ150B: 51.2% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BPQ150C: 51.2% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BPQ150D: 51.2% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BPXSY4: 94.5% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BPXDI4: 94.5% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping BPAEN4: 93.0% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping LBDHRPLC: 70.1% missing (>50.0%)
[2026-01-14 10:18:52] [WARNING]   Skipping WHR: 85.3% missing (>50.0%)
[2026-01-14 10:18:52] [INFO] Imputed missing values for 96 features
[2026-01-14 10:18:52] [INFO] After dropping critical NaN: 35061 rows
[2026-01-14 10:18:52] [INFO] 
--- STEP 6: Creating Splits ---
[2026-01-14 10:18:52] [INFO] Creating stratified train/val/test splits...
[2026-01-14 10:18:52] [INFO] Split sizes - Train: 24542, Val: 5259, Test: 5260
[2026-01-14 10:18:52] [INFO] Events - Train: 2283, Val: 490, Test: 489
[2026-01-14 10:18:52] [INFO] 
--- STEP 7: Standardizing Features ---
[2026-01-14 10:18:53] [INFO] Fitted StandardScaler on 134 features
[2026-01-14 10:18:53] [INFO] Applied StandardScaler to 134 features
[2026-01-14 10:18:53] [INFO] Applied StandardScaler to 134 features
[2026-01-14 10:18:53] [INFO] 
--- STEP 8: Saving Datasets ---
[2026-01-14 10:18:53] [INFO] Saved datasets in parquet format
[2026-01-14 10:18:53] [INFO] Saved SEQN arrays
[2026-01-14 10:18:53] [INFO] 
--- STEP 9: Generating Metadata ---
[2026-01-14 10:18:53] [INFO] Saved preprocessing metadata
[2026-01-14 10:18:53] [INFO] Saved feature documentation
[2026-01-14 10:18:53] [INFO] 
============================================================
[2026-01-14 10:18:53] [INFO] PREPROCESSING COMPLETE
[2026-01-14 10:18:53] [INFO] ============================================================
[2026-01-14 10:18:53] [INFO] Train: 24542 samples, 2283 events (9.3%)
[2026-01-14 10:18:53] [INFO] Val: 5259 samples, 490 events (9.3%)
[2026-01-14 10:18:53] [INFO] Test: 5260 samples, 489 events (9.3%)
[2026-01-14 10:18:53] [INFO] Features: 134
[2026-01-14 10:18:53] [INFO] Output directory: /mnt/data3/jinxxie/ai-scientist-bio-age-lab/NHANES_processed