========================================
📋 Task: multiarith | Split: test
🔑 Keys Mapped -> Q: 'instruction', A: 'output', GT: 'answer'
========================================
🧪 Ground Truth Type Analysis (from gt_key)
========================================
Type [FLOAT]: 600 (100.00%)
Type [ABC]: 0 (0.00%)
Type [TEXT]: 0 (0.00%)
========================================
📊 Raw Text Length Analysis (chars)
========================================
Question (instruction):
  Mean : 159.91
  Std  : 25.39
  Min  : 108
  Max  : 217
  Count: 600

Output (output):
  Mean : 190.56
  Std  : 35.41
  Min  : 93
  Max  : 330
  Count: 600

========================================
🔢 Tokenized Length Analysis (tokens)
========================================
Question Tokens:
  Mean : 53.31
  Std  : 5.04
  Min  : 42
  Max  : 67
  Count: 600

Answer Tokens:
  Mean : 67.88
  Std  : 9.72
  Min  : 42
  Max  : 100
  Count: 600

Total Tokens:
  Mean : 121.19
  Std  : 12.04
  Min  : 90
  Max  : 160
  Count: 600

--------------------
--- Question Tokens (instruction) ---
<= 128: 600 (100.00%)
<= 256: 600 (100.00%)
<= 512: 600 (100.00%)
<= 1024: 600 (100.00%)

--- Answer Tokens (output) ---
<= 128: 600 (100.00%)
<= 256: 600 (100.00%)
<= 512: 600 (100.00%)
<= 1024: 600 (100.00%)

--- Total Tokens ---
<= 128: 449 (74.83%)
<= 256: 600 (100.00%)
<= 512: 600 (100.00%)
<= 1024: 600 (100.00%)
========================================