========================================
📋 Task: singleeq | Split: test
🔑 Keys Mapped -> Q: 'instruction', A: 'output', GT: 'answer'
========================================
🧪 Ground Truth Type Analysis (from gt_key)
========================================
Type [FLOAT]: 508 (100.00%)
Type [ABC]: 0 (0.00%)
Type [TEXT]: 0 (0.00%)
========================================
📊 Raw Text Length Analysis (chars)
========================================
Question (instruction):
  Mean : 142.54
  Std  : 51.04
  Min  : 62
  Max  : 364
  Count: 508

Output (output):
  Mean : 164.51
  Std  : 59.40
  Min  : 71
  Max  : 489
  Count: 508

========================================
🔢 Tokenized Length Analysis (tokens)
========================================
Question Tokens:
  Mean : 51.96
  Std  : 16.64
  Min  : 29
  Max  : 142
  Count: 508

Answer Tokens:
  Mean : 65.61
  Std  : 32.38
  Min  : 29
  Max  : 278
  Count: 508

Total Tokens:
  Mean : 117.57
  Std  : 47.07
  Min  : 59
  Max  : 409
  Count: 508

--------------------
--- Question Tokens (instruction) ---
<= 128: 505 (99.41%)
<= 256: 508 (100.00%)
<= 512: 508 (100.00%)
<= 1024: 508 (100.00%)

--- Answer Tokens (output) ---
<= 128: 480 (94.49%)
<= 256: 507 (99.80%)
<= 512: 508 (100.00%)
<= 1024: 508 (100.00%)

--- Total Tokens ---
<= 128: 382 (75.20%)
<= 256: 493 (97.05%)
<= 512: 508 (100.00%)
<= 1024: 508 (100.00%)
========================================