========================================
📋 Task: aqua | Split: test
🔑 Keys Mapped -> Q: 'instruction', A: 'output', GT: 'answer'
========================================
🧪 Ground Truth Type Analysis (from gt_key)
========================================
Type [FLOAT]: 0 (0.00%)
Type [ABC]: 254 (100.00%)
Type [TEXT]: 0 (0.00%)
========================================
📊 Raw Text Length Analysis (chars)
========================================
Question (instruction):
  Mean : 256.45
  Std  : 84.46
  Min  : 69
  Max  : 553
  Count: 254

Output (output):
  Mean : 234.19
  Std  : 103.86
  Min  : 51
  Max  : 652
  Count: 254

========================================
🔢 Tokenized Length Analysis (tokens)
========================================
Question Tokens:
  Mean : 102.01
  Std  : 20.08
  Min  : 51
  Max  : 178
  Count: 254

Answer Tokens:
  Mean : 78.29
  Std  : 33.26
  Min  : 15
  Max  : 211
  Count: 254

Total Tokens:
  Mean : 180.30
  Std  : 43.47
  Min  : 81
  Max  : 313
  Count: 254

--------------------
--- Question Tokens (instruction) ---
<= 128: 232 (91.34%)
<= 256: 254 (100.00%)
<= 512: 254 (100.00%)
<= 1024: 254 (100.00%)

--- Answer Tokens (output) ---
<= 128: 235 (92.52%)
<= 256: 254 (100.00%)
<= 512: 254 (100.00%)
<= 1024: 254 (100.00%)

--- Total Tokens ---
<= 128: 22 (8.66%)
<= 256: 240 (94.49%)
<= 512: 254 (100.00%)
<= 1024: 254 (100.00%)
========================================