========================================
📋 Task: addsub | Split: test
🔑 Keys Mapped -> Q: 'instruction', A: 'output', GT: 'answer'
========================================
🧪 Ground Truth Type Analysis (from gt_key)
========================================
Type [FLOAT]: 395 (100.00%)
Type [ABC]: 0 (0.00%)
Type [TEXT]: 0 (0.00%)
========================================
📊 Raw Text Length Analysis (chars)
========================================
Question (instruction):
  Mean : 152.75
  Std  : 46.57
  Min  : 77
  Max  : 356
  Count: 395

Output (output):
  Mean : 169.63
  Std  : 51.55
  Min  : 92
  Max  : 365
  Count: 395

========================================
🔢 Tokenized Length Analysis (tokens)
========================================
Question Tokens:
  Mean : 55.09
  Std  : 15.94
  Min  : 34
  Max  : 136
  Count: 395

Answer Tokens:
  Mean : 67.81
  Std  : 31.22
  Min  : 36
  Max  : 206
  Count: 395

Total Tokens:
  Mean : 122.90
  Std  : 45.97
  Min  : 70
  Max  : 342
  Count: 395

--------------------
--- Question Tokens (instruction) ---
<= 128: 394 (99.75%)
<= 256: 395 (100.00%)
<= 512: 395 (100.00%)
<= 1024: 395 (100.00%)

--- Answer Tokens (output) ---
<= 128: 369 (93.42%)
<= 256: 395 (100.00%)
<= 512: 395 (100.00%)
<= 1024: 395 (100.00%)

--- Total Tokens ---
<= 128: 283 (71.65%)
<= 256: 383 (96.96%)
<= 512: 395 (100.00%)
<= 1024: 395 (100.00%)
========================================