[2025-09-12 15:31:46] Task: math | Benchmark: AIME24 | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
[2025-09-12 15:31:56] Task: math | Benchmark: AIME24 | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
[2025-09-12 15:32:08] Task: math | Benchmark: AIME24 | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
[2025-09-12 15:32:19] Task: math | Benchmark: AIME24 | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
[2025-09-12 15:32:29] Task: math | Benchmark: AIME24 | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
[2025-09-12 15:32:41] Task: math | Benchmark: AIME24 | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
[2025-09-12 15:32:51] Task: math | Benchmark: AIME25 | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
[2025-09-12 15:33:01] Task: math | Benchmark: AIME25 | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
[2025-09-12 15:33:10] Task: math | Benchmark: AIME25 | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
[2025-09-12 15:33:23] Task: math | Benchmark: AIME25 | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
[2025-09-12 15:33:35] Task: math | Benchmark: AIME25 | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
[2025-09-12 15:33:45] Task: math | Benchmark: AIME25 | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
[2025-09-12 16:55:58] Task: math | Benchmark: AIME24 | Reasoning: true | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 1
[2025-09-12 16:56:11] Task: math | Benchmark: AIME24 | Reasoning: true | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 3
[2025-09-12 16:56:23] Task: math | Benchmark: AIME24 | Reasoning: true | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 5
[2025-09-12 16:56:32] Task: math | Benchmark: AIME24 | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 1
[2025-09-12 16:56:44] Task: math | Benchmark: AIME24 | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 3
[2025-09-12 16:56:57] Task: math | Benchmark: AIME24 | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 5
[2025-09-12 16:57:09] Task: math | Benchmark: AIME25 | Reasoning: true | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 1
[2025-09-12 16:57:18] Task: math | Benchmark: AIME25 | Reasoning: true | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 3
[2025-09-12 16:57:28] Task: math | Benchmark: AIME25 | Reasoning: true | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 5
[2025-09-12 16:57:40] Task: math | Benchmark: AIME25 | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 1
[2025-09-12 17:04:48] Task: math | Benchmark: AIME24 | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
reasoning_agent: 0.03333333333333333
tool_agent: 0.06666666666666667
[2025-09-12 17:06:09] Task: math | Benchmark: AIME24 | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
[2025-09-12 17:07:59] Task: math | Benchmark: AIME24 | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
[2025-09-12 17:08:27] Task: math | Benchmark: AIME24 | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
[2025-09-12 17:08:45] Task: math | Benchmark: AIME24 | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
reasoning_agent: 0.06666666666666667
tool_agent: 0.1
[2025-09-12 17:10:02] Task: math | Benchmark: AIME24 | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
reasoning_agent: 0.2
tool_agent: 0.23333333333333334
[2025-09-12 17:13:27] Task: math | Benchmark: AIME24 | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
reasoning_agent: 0.23333333333333334
tool_agent: 0.3
[2025-09-12 17:18:40] Task: math | Benchmark: AIME24 | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
reasoning_agent: 0.1
tool_agent: 0.1
[2025-09-12 17:19:49] Task: math | Benchmark: AIME24 | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
reasoning_agent: 0.2
tool_agent: 0.13333333333333333
[2025-09-12 17:21:39] Task: math | Benchmark: AIME24 | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
reasoning_agent: 0.16666666666666666
tool_agent: 0.13333333333333333
[2025-09-12 17:23:24] Task: math | Benchmark: AIME25 | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
reasoning_agent: 0.1
tool_agent: 0.1
[2025-09-12 17:24:41] Task: math | Benchmark: AIME25 | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
reasoning_agent: 0.3333333333333333
tool_agent: 0.26666666666666666
[2025-09-12 17:28:14] Task: math | Benchmark: AIME25 | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
reasoning_agent: 0.3333333333333333
tool_agent: 0.3
[2025-09-12 17:33:05] Task: math | Benchmark: AIME25 | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
reasoning_agent: 0.13333333333333333
tool_agent: 0.03333333333333333
[2025-09-12 17:34:02] Task: math | Benchmark: AIME25 | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
reasoning_agent: 0.1
tool_agent: 0.13333333333333333
[2025-09-12 17:36:27] Task: math | Benchmark: AIME25 | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
reasoning_agent: 0.13333333333333333
tool_agent: 0.2
[2025-09-12 17:38:49] Task: math | Benchmark: OlympiadBench_test | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
[2025-09-12 17:45:09] Task: math | Benchmark: OlympiadBench_test | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
[2025-09-12 17:45:29] Task: math | Benchmark: OlympiadBench_test | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
[2025-09-12 17:45:42] Task: math | Benchmark: OlympiadBench_test | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
[2025-09-12 17:45:55] Task: math | Benchmark: OlympiadBench_test | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
[2025-09-12 17:46:09] Task: math | Benchmark: OlympiadBench_test | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
[2025-09-12 17:46:18] Task: math | Benchmark: gsm8k_test | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
[2025-09-12 17:46:29] Task: math | Benchmark: gsm8k_test | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
[2025-09-12 17:46:39] Task: math | Benchmark: gsm8k_test | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
[2025-09-12 17:46:42] Task: math | Benchmark: AIME24 | Reasoning: true | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 1
[2025-09-12 17:46:49] Task: math | Benchmark: gsm8k_test | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
[2025-09-12 17:46:55] Task: math | Benchmark: AIME24 | Reasoning: true | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 3
[2025-09-12 17:46:59] Task: math | Benchmark: gsm8k_test | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
[2025-09-12 17:47:05] Task: math | Benchmark: AIME24 | Reasoning: true | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 5
[2025-09-12 17:47:09] Task: math | Benchmark: gsm8k_test | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
[2025-09-12 17:47:16] Task: math | Benchmark: AIME24 | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 1
[2025-09-12 17:47:19] Task: code | Benchmark: apps | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
[2025-09-12 17:47:25] Task: math | Benchmark: AIME24 | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 3
[2025-09-12 17:47:29] Task: code | Benchmark: apps | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
[2025-09-12 17:47:38] Task: math | Benchmark: AIME24 | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 5
[2025-09-12 17:47:38] Task: code | Benchmark: apps | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
[2025-09-12 17:47:48] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
[2025-09-12 17:47:49] Task: math | Benchmark: AIME25 | Reasoning: true | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 1
[2025-09-12 17:47:59] Task: math | Benchmark: AIME25 | Reasoning: true | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 3
[2025-09-12 17:48:00] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
[2025-09-12 17:48:12] Task: math | Benchmark: AIME25 | Reasoning: true | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 5
[2025-09-12 17:48:12] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
[2025-09-12 17:48:21] Task: math | Benchmark: AIME25 | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 1
[2025-09-12 17:48:26] Task: code | Benchmark: livecodebench | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
[2025-09-12 17:48:31] Task: math | Benchmark: AIME25 | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 3
[2025-09-12 17:48:37] Task: code | Benchmark: livecodebench | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
[2025-09-12 17:48:43] Task: math | Benchmark: AIME25 | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 5
[2025-09-12 17:48:48] Task: code | Benchmark: livecodebench | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
[2025-09-12 17:48:53] Task: math | Benchmark: OlympiadBench_test | Reasoning: true | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 1
[2025-09-12 17:48:57] Task: code | Benchmark: livecodebench | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
[2025-09-12 17:49:03] Task: math | Benchmark: OlympiadBench_test | Reasoning: true | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 3
[2025-09-12 17:49:07] Task: code | Benchmark: livecodebench | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
[2025-09-12 17:49:13] Task: math | Benchmark: OlympiadBench_test | Reasoning: true | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 5
[2025-09-12 17:49:17] Task: code | Benchmark: livecodebench | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
[2025-09-12 17:49:23] Task: math | Benchmark: OlympiadBench_test | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 1
[2025-09-12 17:49:28] Task: code | Benchmark: code_contests | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
[2025-09-12 17:49:32] Task: math | Benchmark: OlympiadBench_test | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 3
[2025-09-12 17:49:40] Task: code | Benchmark: code_contests | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
[2025-09-12 17:49:42] Task: math | Benchmark: OlympiadBench_test | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 5
[2025-09-12 17:49:50] Task: code | Benchmark: code_contests | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
[2025-09-12 17:49:54] Task: math | Benchmark: gsm8k_test | Reasoning: true | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 1
[2025-09-12 17:50:00] Task: code | Benchmark: code_contests | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
[2025-09-12 17:50:05] Task: math | Benchmark: gsm8k_test | Reasoning: true | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 3
[2025-09-12 17:50:10] Task: code | Benchmark: code_contests | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
[2025-09-12 17:50:15] Task: math | Benchmark: gsm8k_test | Reasoning: true | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 5
[2025-09-12 17:50:24] Task: math | Benchmark: gsm8k_test | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 1
[2025-09-12 17:50:27] Task: code | Benchmark: code_contests | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
[2025-09-12 17:50:36] Task: math | Benchmark: gsm8k_test | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 3
[2025-09-12 17:50:46] Task: math | Benchmark: gsm8k_test | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 5
[2025-09-12 17:50:57] Task: code | Benchmark: apps | Reasoning: true | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 1
[2025-09-12 17:51:07] Task: code | Benchmark: apps | Reasoning: true | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 3
[2025-09-12 17:51:16] Task: code | Benchmark: apps | Reasoning: true | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 5
[2025-09-12 17:51:26] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 1
[2025-09-12 17:51:35] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 3
[2025-09-12 17:51:45] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 5
[2025-09-12 17:51:55] Task: code | Benchmark: livecodebench | Reasoning: true | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 1
[2025-09-12 17:52:05] Task: code | Benchmark: livecodebench | Reasoning: true | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 3
[2025-09-12 17:52:15] Task: code | Benchmark: livecodebench | Reasoning: true | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 5
[2025-09-12 17:52:25] Task: code | Benchmark: livecodebench | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 1
[2025-09-12 17:52:35] Task: code | Benchmark: livecodebench | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 3
[2025-09-12 17:52:45] Task: code | Benchmark: livecodebench | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 5
[2025-09-12 17:52:54] Task: code | Benchmark: code_contests | Reasoning: true | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 1
[2025-09-12 17:53:06] Task: code | Benchmark: code_contests | Reasoning: true | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 3
[2025-09-12 17:53:15] Task: code | Benchmark: code_contests | Reasoning: true | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 5
[2025-09-12 17:53:28] Task: code | Benchmark: code_contests | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 1
[2025-09-12 17:53:35] Task: code | Benchmark: code_contests | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 3
[2025-09-12 17:53:47] Task: code | Benchmark: code_contests | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 5
[2025-09-12 17:55:39] Task: code | Benchmark: apps | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
[2025-09-12 17:57:02] Task: code | Benchmark: apps | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
[2025-09-12 17:57:13] Task: code | Benchmark: apps | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
[2025-09-12 17:57:15] Task: code | Benchmark: apps | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
code_generator: 0.122
test_generator: 0.036
[2025-09-12 18:06:42] Task: code | Benchmark: apps | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
code_generator: 0.178
test_generator: 0.048
[2025-09-12 18:42:02] Task: code | Benchmark: apps | Reasoning: true | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
[2025-09-12 18:48:39] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
[2025-09-12 18:48:44] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
[2025-09-12 18:51:12] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
[2025-09-12 18:51:17] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
[2025-09-12 18:51:20] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
code_generator: 0.138
test_generator: 0.054
[2025-09-12 18:53:05] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
code_generator: 0.13
test_generator: 0.05
[2025-09-12 18:56:44] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
code_generator: 0.176
test_generator: 0.082
[2025-09-12 19:02:24] Task: code | Benchmark: livecodebench | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
code_generator: 0.13714285714285715
test_generator: 0.05142857142857143
[2025-09-12 19:03:39] Task: code | Benchmark: livecodebench | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
code_generator: 0.17142857142857143
test_generator: 0.10285714285714286
[2025-09-12 19:05:55] Task: code | Benchmark: livecodebench | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
code_generator: 0.14857142857142858
test_generator: 0.08
[2025-09-12 19:09:30] Task: code | Benchmark: code_contests | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
code_generator: 0.030303030303030304
test_generator: 0.01818181818181818
[2025-09-12 19:10:51] Task: code | Benchmark: code_contests | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
code_generator: 0.03636363636363636
test_generator: 0.01818181818181818
[2025-09-12 19:13:33] Task: code | Benchmark: code_contests | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
code_generator: 0.024242424242424242
test_generator: 0.006060606060606061
[2025-09-12 19:17:34] Task: math | Benchmark: OlympiadBench_test | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
reasoning_agent: 0.45103857566765576
tool_agent: 0.22255192878338279
[2025-09-12 19:20:22] Task: math | Benchmark: OlympiadBench_test | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
reasoning_agent: 0.47032640949554894
tool_agent: 0.36350148367952523
[2025-09-12 19:26:11] Task: math | Benchmark: OlympiadBench_test | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
reasoning_agent: 0.4762611275964392
tool_agent: 0.3649851632047478
[2025-09-12 19:34:32] Task: math | Benchmark: gsm8k_test | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
reasoning_agent: 0.816
tool_agent: 0.682
[2025-09-12 19:35:55] Task: math | Benchmark: gsm8k_test | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
reasoning_agent: 0.82
tool_agent: 0.828
[2025-09-12 19:37:34] Task: math | Benchmark: gsm8k_test | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
reasoning_agent: 0.824
tool_agent: 0.82
[2025-09-13 01:46:00] Task: math | Benchmark: AIME24 | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 1
reasoning_agent: 0.13333333333333333
tool_agent: 0.2
[2025-09-13 01:46:59] Task: math | Benchmark: AIME24 | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 3
reasoning_agent: 0.16666666666666666
tool_agent: 0.23333333333333334
[2025-09-13 01:48:37] Task: math | Benchmark: AIME24 | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 5
reasoning_agent: 0.16666666666666666
tool_agent: 0.3333333333333333
[2025-09-13 01:50:46] Task: math | Benchmark: AIME25 | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 1
reasoning_agent: 0.16666666666666666
tool_agent: 0.23333333333333334
[2025-09-13 01:51:37] Task: math | Benchmark: AIME25 | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 3
reasoning_agent: 0.2
tool_agent: 0.26666666666666666
[2025-09-13 01:53:18] Task: math | Benchmark: AIME25 | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 5
reasoning_agent: 0.2
tool_agent: 0.3333333333333333
[2025-09-13 01:55:25] Task: math | Benchmark: OlympiadBench_test | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 1
reasoning_agent: 0.49258160237388726
tool_agent: 0.3916913946587537
[2025-09-13 01:58:17] Task: math | Benchmark: OlympiadBench_test | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 3
reasoning_agent: 0.5222551928783383
tool_agent: 0.47774480712166173
[2025-09-13 02:03:20] Task: math | Benchmark: OlympiadBench_test | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 5
reasoning_agent: 0.5252225519287834
tool_agent: 0.4792284866468843
[2025-09-13 02:10:55] Task: math | Benchmark: gsm8k_test | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 1
reasoning_agent: 0.926
tool_agent: 0.866
[2025-09-13 02:13:02] Task: math | Benchmark: gsm8k_test | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 3
reasoning_agent: 0.93
tool_agent: 0.952
[2025-09-13 02:14:55] Task: math | Benchmark: gsm8k_test | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 5
reasoning_agent: 0.932
tool_agent: 0.95
[2025-09-13 02:16:53] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 1
code_generator: 0.296
test_generator: 0.126
[2025-09-13 02:20:17] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 3
code_generator: 0.354
test_generator: 0.248
[2025-09-13 02:26:47] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 5
code_generator: 0.354
test_generator: 0.278
[2025-09-13 02:36:20] Task: code | Benchmark: livecodebench | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 1
code_generator: 0.13714285714285715
test_generator: 0.06285714285714286
[2025-09-13 02:38:39] Task: code | Benchmark: livecodebench | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 3
code_generator: 0.26285714285714284
test_generator: 0.18285714285714286
[2025-09-13 02:42:40] Task: code | Benchmark: livecodebench | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 5
code_generator: 0.2342857142857143
test_generator: 0.17714285714285713
[2025-09-13 02:49:07] Task: code | Benchmark: code_contests | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 1
code_generator: 0.09696969696969697
test_generator: 0.01818181818181818
[2025-09-13 02:51:29] Task: code | Benchmark: code_contests | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 3
code_generator: 0.13333333333333333
test_generator: 0.04242424242424243
[2025-09-13 02:56:19] Task: code | Benchmark: code_contests | Reasoning: false | Model: /home/lah003/models/Qwen3-4B | MaxTurns: 5
code_generator: 0.1696969696969697
test_generator: 0.07272727272727272

reasoning_agent: 0.2
tool_agent: 0.23333333333333334
[2025-09-13 21:05:09] Task: math | Benchmark: AIME24 | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 3
reasoning_agent: 0.23333333333333334
tool_agent: 0.3
[2025-09-13 21:07:49] Task: math | Benchmark: AIME24 | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 5
reasoning_agent: 0.2
tool_agent: 0.36666666666666664
[2025-09-13 21:11:10] Task: math | Benchmark: AIME25 | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 1
reasoning_agent: 0.13333333333333333
tool_agent: 0.16666666666666666
[2025-09-13 21:12:39] Task: math | Benchmark: AIME25 | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 3
reasoning_agent: 0.13333333333333333
tool_agent: 0.23333333333333334
[2025-09-13 21:15:18] Task: math | Benchmark: AIME25 | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 5
reasoning_agent: 0.13333333333333333
tool_agent: 0.23333333333333334
[2025-09-13 21:17:59] Task: math | Benchmark: OlympiadBench_test | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 1
reasoning_agent: 0.5192878338278932
tool_agent: 0.3249258160237389
[2025-09-13 21:22:06] Task: math | Benchmark: OlympiadBench_test | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 3
[2025-09-13 21:26:46] Task: math | Benchmark: OlympiadBench_test | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 5
[2025-09-13 21:27:08] Task: math | Benchmark: AIME24 | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 1
reasoning_agent: 0.2
tool_agent: 0.2
[2025-09-13 21:30:24] Task: math | Benchmark: AIME24 | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 3
reasoning_agent: 0.23333333333333334
tool_agent: 0.3
[2025-09-13 21:35:53] Task: math | Benchmark: AIME24 | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 5
reasoning_agent: 0.23333333333333334
tool_agent: 0.36666666666666664
[2025-09-13 21:41:19] Task: math | Benchmark: AIME25 | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 1
reasoning_agent: 0.16666666666666666
tool_agent: 0.16666666666666666
[2025-09-13 21:44:35] Task: math | Benchmark: AIME25 | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 3
reasoning_agent: 0.16666666666666666
tool_agent: 0.23333333333333334
[2025-09-13 21:47:47] Task: math | Benchmark: AIME25 | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 5
reasoning_agent: 0.23333333333333334
tool_agent: 0.36666666666666664
[2025-09-13 21:51:27] Task: math | Benchmark: OlympiadBench_test | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 1
reasoning_agent: 0.5519287833827893
tool_agent: 0.32344213649851633
[2025-09-13 21:59:07] Task: math | Benchmark: OlympiadBench_test | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 3
reasoning_agent: 0.5667655786350149
tool_agent: 0.4732937685459941
[2025-09-13 22:11:29] Task: math | Benchmark: OlympiadBench_test | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 5
reasoning_agent: 0.56973293768546
tool_agent: 0.4821958456973294
[2025-09-13 22:26:49] Task: math | Benchmark: gsm8k_test | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 1
reasoning_agent: 0.936
tool_agent: 0.924
[2025-09-13 22:31:09] Task: math | Benchmark: gsm8k_test | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 3
reasoning_agent: 0.936
tool_agent: 0.958
[2025-09-13 22:34:16] Task: math | Benchmark: gsm8k_test | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 5
reasoning_agent: 0.936
tool_agent: 0.958
[2025-09-13 22:36:40] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 1
code_generator: 0.302
test_generator: 0.212
[2025-09-13 22:41:30] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 3
code_generator: 0.444
test_generator: 0.386
[2025-09-13 22:52:43] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 5
code_generator: 0.448
test_generator: 0.404
[2025-09-13 23:10:30] Task: code | Benchmark: livecodebench | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 1
code_generator: 0.21714285714285714
test_generator: 0.10857142857142857
[2025-09-13 23:13:53] Task: code | Benchmark: livecodebench | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 3
code_generator: 0.2914285714285714
test_generator: 0.21142857142857144
[2025-09-13 23:22:10] Task: code | Benchmark: livecodebench | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 5
code_generator: 0.29714285714285715
test_generator: 0.25142857142857145
[2025-09-13 23:35:13] Task: code | Benchmark: code_contests | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 1
code_generator: 0.15757575757575756
test_generator: 0.05454545454545454
[2025-09-13 23:38:33] Task: code | Benchmark: code_contests | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 3
code_generator: 0.17575757575757575
test_generator: 0.10909090909090909
[2025-09-13 23:48:09] Task: code | Benchmark: code_contests | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 5
code_generator: 0.17575757575757575
test_generator: 0.09696969696969697
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.0
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.0
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.0
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.0
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.0
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.0
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.0
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.0
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.0
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.0
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.0
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.0
-----sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.0
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.0
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.0
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.0
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.0
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.0
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.0
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.0
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.0
[2025-09-14 03:05:52] Task: math | Benchmark: AIME24 | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 1
[2025-09-14 03:06:03] Task: math | Benchmark: AIME24 | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 3
[2025-09-14 03:06:14] Task: math | Benchmark: AIME24 | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 5
[2025-09-14 03:06:24] Task: math | Benchmark: AIME25 | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 1
[2025-09-14 03:06:34] Task: math | Benchmark: AIME25 | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 3
[2025-09-14 03:06:44] Task: math | Benchmark: AIME25 | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 5
[2025-09-14 03:06:53] Task: math | Benchmark: OlympiadBench_test | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 1
[2025-09-14 03:07:05] Task: math | Benchmark: OlympiadBench_test | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 3
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.0
[2025-09-14 03:07:30] Task: math | Benchmark: OlympiadBench_test | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 5
[2025-09-14 03:07:41] Task: math | Benchmark: gsm8k_test | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 1
[2025-09-14 03:07:51] Task: math | Benchmark: gsm8k_test | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 3
[2025-09-14 03:08:01] Task: math | Benchmark: gsm8k_test | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 5
[2025-09-14 03:08:12] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 1
[2025-09-14 03:08:22] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 3
[2025-09-14 03:08:32] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 5
[2025-09-14 03:08:42] Task: code | Benchmark: livecodebench | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 1
[2025-09-14 03:08:52] Task: code | Benchmark: livecodebench | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 3
[2025-09-14 03:09:02] Task: code | Benchmark: livecodebench | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 5
[2025-09-14 03:09:13] Task: code | Benchmark: code_contests | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 1
[2025-09-14 03:09:25] Task: code | Benchmark: code_contests | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 3
[2025-09-14 03:09:38] Task: code | Benchmark: code_contests | Reasoning: false | Model: /home/lah003/models/Qwen3-8B | MaxTurns: 5
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.0
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.0
[2025-09-14 11:51:17] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
[2025-09-14 11:51:17] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
[2025-09-14 11:51:17] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
[2025-09-14 11:51:17] Task: code | Benchmark: livecodebench | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
[2025-09-14 11:51:17] Task: code | Benchmark: livecodebench | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
[2025-09-14 11:51:17] Task: code | Benchmark: livecodebench | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
[2025-09-14 11:51:17] Task: code | Benchmark: code_contests | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
[2025-09-14 11:51:17] Task: code | Benchmark: code_contests | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
[2025-09-14 11:51:17] Task: code | Benchmark: code_contests | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
[2025-09-14 11:51:18] Task: math | Benchmark: OlympiadBench_test | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
[2025-09-14 11:51:18] Task: math | Benchmark: OlympiadBench_test | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
[2025-09-14 11:51:18] Task: math | Benchmark: OlympiadBench_test | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
[2025-09-14 11:51:18] Task: math | Benchmark: gsm8k_test | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
[2025-09-14 11:51:18] Task: math | Benchmark: gsm8k_test | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 3
[2025-09-14 11:51:18] Task: math | Benchmark: gsm8k_test | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 5
reasoning_agent: 0.0
tool_agent: 0.06666666666666667
reasoning_agent: 0.16666666666666666
tool_agent: 0.06666666666666667
reasoning_agent: 0.16666666666666666
tool_agent: 0.1
reasoning_agent: 0.13333333333333333
tool_agent: 0.1
reasoning_agent: 0.13333333333333333
tool_agent: 0.1
reasoning_agent: 0.23333333333333334
tool_agent: 0.36666666666666664
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.26666666666666666
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.0
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.4666666666666667
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.3333333333333333
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.3333333333333333
sample_reasoning_agent: 0.0
sample_tool_agent: 0.0
aggreted_agent: 0.4666666666666667
sample_reasoning_agent: 0.3
sample_tool_agent: 0.36666666666666664
aggreted_agent: 0.36666666666666664
sample_reasoning_agent: 0.8666666666666667
sample_tool_agent: 0.9666666666666667
aggreted_agent: 0.4
sample_reasoning_agent: 0.4
sample_tool_agent: 0.1
aggreted_agent: 0.23333333333333334
sample_reasoning_agent: 0.8
sample_tool_agent: 0.9666666666666667
aggreted_agent: 0.4
sample_reasoning_agent: 1.8333333333333333
sample_tool_agent: 1.3333333333333333
aggreted_agent: 0.3
sample_reasoning_agent: 0.9
sample_tool_agent: 0.8333333333333334
aggreted_agent: 0.3333333333333333
sample_reasoning_agent: 0.6
sample_tool_agent: 0.8666666666666667
aggreted_agent: 0.26666666666666666
sample_reasoning_agent: 0.6666666666666666
sample_tool_agent: 0.9333333333333333
aggreted_agent: 0.3333333333333333
code_generator: 0.2057142857142857
test_generator: 0.17142857142857143
code_generator: 0.17142857142857143
test_generator: 0.13142857142857142
code_generator: 0.17142857142857143
test_generator: 0.12
code_generator: 0.2
test_generator: 0.17714285714285713
code_generator: 0.17142857142857143
test_generator: 0.10285714285714286
code_generator: 0.2057142857142857
test_generator: 0.17714285714285713
sample_reasoning_agent: 0.3333333333333333
sample_tool_agent: 0.23333333333333334
aggreted_agent: 0.36666666666666664
sample_reasoning_agent: 0.43333333333333335
sample_tool_agent: 0.3333333333333333
aggreted_agent: 0.43333333333333335
sample_reasoning_agent: 0.3333333333333333
sample_tool_agent: 0.3333333333333333
aggreted_agent: 0.3
sample_reasoning_agent: 0.39166666666666666
sample_tool_agent: 0.38333333333333336
aggreted_agent: 0.375
sample_reasoning_agent: 0.09166666666666666
sample_tool_agent: 0.075
aggreted_agent: 0.09166666666666666
sample_reasoning_agent: 0.125
sample_tool_agent: 0.06666666666666667
aggreted_agent: 0.11666666666666667
sample_reasoning_agent: 0.1
sample_tool_agent: 0.05
aggreted_agent: 0.11666666666666667
sample_reasoning_agent: 0.5
sample_tool_agent: 0.5333333333333333
aggreted_agent: 0.5666666666666667
sample_reasoning_agent: 0.65
sample_tool_agent: 0.6666666666666666
aggreted_agent: 0.6
sample_reasoning_agent: 0.2
sample_tool_agent: 0.2
aggreted_agent: 0.16666666666666666
sample_reasoning_agent: 0.1
sample_tool_agent: 0.2
aggreted_agent: 0.13333333333333333
sample_reasoning_agent: 0.18333333333333332
sample_tool_agent: 0.21666666666666667
aggreted_agent: 0.15
sample_reasoning_agent: 0.18333333333333332
sample_tool_agent: 0.21666666666666667
aggreted_agent: 0.13333333333333333
sample_reasoning_agent: 1.3333333333333333
sample_tool_agent: 1.1166666666666667
aggreted_agent: 0.2833333333333333
sample_reasoning_agent: 0.6333333333333333
sample_tool_agent: 0.48333333333333334
aggreted_agent: 0.6833333333333333
sample_reasoning_agent: 1.3166666666666667
sample_tool_agent: 1.4
aggreted_agent: 0.7166666666666667
sample_reasoning_agent: 0.4166666666666667
sample_tool_agent: 0.4666666666666667
aggreted_agent: 0.26666666666666666
sample_reasoning_agent: 0.2
sample_tool_agent: 0.26666666666666666
aggreted_agent: 0.26666666666666666
sample_reasoning_agent: 0.15
sample_tool_agent: 0.23333333333333334
aggreted_agent: 0.3
sample_reasoning_agent: 0.11666666666666667
sample_tool_agent: 0.06666666666666667
aggreted_agent: 0.11666666666666667
code_generator: 0.0
test_generator: 0.0
code_generator: 0.0
test_generator: 0.0
code_generator: 0.0
test_generator: 0.0
code_generator: 0.18857142857142858
test_generator: 0.12571428571428572
code_generator: 0.17
test_generator: 0.112
code_generator: 0.162
test_generator: 0.09
code_generator: 0.030303030303030304
test_generator: 0.030303030303030304
code_generator: 0.166
test_generator: 0.11
code_generator: 0.048484848484848485
test_generator: 0.03636363636363636
code_generator: 0.03636363636363636
test_generator: 0.012121212121212121
code_generator: 0.03636363636363636
test_generator: 0.01818181818181818
code_generator: 0.176
test_generator: 0.122
code_generator: 0.07878787878787878
test_generator: 0.030303030303030304
code_generator: 0.186
test_generator: 0.142
sample_reasoning_agent: 0.13333333333333333
sample_tool_agent: 0.06666666666666667
aggreted_agent: 0.1
sample_reasoning_agent: 0.13333333333333333
sample_tool_agent: 0.06666666666666667
aggreted_agent: 0.13333333333333333
reasoning_agent: 0.13333333333333333
tool_agent: 0.06666666666666667
reasoning_agent: 0.16666666666666666
tool_agent: 0.13333333333333333
reasoning_agent: 0.16666666666666666
tool_agent: 0.06666666666666667
sample_reasoning_agent: 0.16666666666666666
sample_tool_agent: 0.08333333333333333
aggreted_agent: 0.16666666666666666
reasoning_agent: 0.15
tool_agent: 0.06666666666666667
reasoning_agent: 0.13333333333333333
tool_agent: 0.06666666666666667
reasoning_agent: 0.16666666666666666
tool_agent: 0.0
plan_agent: 0.0
tool_call_agent: 0.0
reasoning_agent: 0.16666666666666666
tool_agent: 0.1
[2025-09-15 19:58:27] Task: code | Benchmark: apps | Reasoning: false | Model: /home/lah003/models/Qwen3-1.7B | MaxTurns: 1
plan_agent: 0.0
tool_call_agent: 0.0
plan_agent: 0.69
tool_call_agent: 0.69
plan_agent: 0.7
tool_call_agent: 0.7
plan_agent: 0.56
tool_call_agent: 0.82
plan_agent: 0.23
tool_call_agent: 0.52
plan_agent: 0.04
tool_call_agent: 0.04
plan_agent: 0.18
tool_call_agent: 0.18
tool_call_agent: 0.09
plan_agent: 0.19
plan_agent: 0.18
tool_call_agent: 0.0
plan_agent: 0.33
plan_agent: 0.28
plan_agent: 0.29
tool_call_agent: 0.0
plan_agent: 0.14
tool_call_agent: 0.0
plan_agent: 0.15
tool_call_agent: 0.0
plan_agent: 0.11
tool_call_agent: 0.0
plan_agent: 0.12
plan_agent: 0.12
tool_call_agent: 0.0
plan_agent: 0.73
plan_agent: 0.46
plan_agent: 0.46
plan_agent: 0.17
tool_call_agent: 0.0
plan_agent: 0.72
plan_agent: 0.05
tool_call_agent: 0.0
plan_agent: 0.06
tool_call_agent: 0.0
plan_agent: 0.04
tool_call_agent: 0.0
plan_agent: 0.05
tool_call_agent: 0.0
plan_agent: 0.06
tool_call_agent: 0.0
plan_agent: 0.05
tool_call_agent: 0.0
plan_agent: 0.1
plan_agent: 0.1
plan_agent: 0.0
plan_agent: 0.0
plan_agent: 0.36
tool_call_agent: 0.0
plan_agent: 0.79
plan_agent: 0.0
plan_agent: 0.72
plan_agent: 0.04
tool_call_agent: 0.0
plan_agent: 0.01
tool_call_agent: 0.0
plan_agent: 0.02
tool_call_agent: 0.0
plan_agent: 0.04
tool_call_agent: 0.0
plan_agent: 0.0
tool_call_agent: 0.0
plan_agent: 0.01
tool_call_agent: 0.0
plan_agent: 0.13
tool_call_agent: 0.0
plan_agent: 0.62
plan_agent: 0.02
tool_call_agent: 0.0
plan_agent: 0.34
plan_agent: 0.0
tool_call_agent: 0.0
plan_agent: 0.01
tool_call_agent: 0.0
plan_agent: 0.59
plan_agent: 0.25
plan_agent: 0.0
tool_call_agent: 0.0
plan_agent: 0.4
