2025-08-17 01:38:42 - eval_util - INFO - Using naive RTN quantization.
2025-08-17 01:38:42 - eval_util - INFO - Quantizing activation with bitwidth 4
2025-08-17 01:38:43 - accelerate.utils.modeling - INFO - We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
2025-08-17 01:41:44 - accelerate.utils.modeling - INFO - We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
2025-08-17 01:41:44 - eval_util - INFO - Evaluating wikitext...
2025-08-17 01:42:53 - eval_util - INFO - Wikitext completed: 6.5926737785339355
2025-08-17 01:42:53 - eval_util - INFO - Starting evaluation of task: boolq
2025-08-17 01:42:53 - eval_util - INFO - GPU Memory before boolq: 3.81GB
2025-08-17 01:42:53 - lm_eval.models.huggingface - WARNING - `pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
2025-08-17 01:42:53 - lm_eval.models.huggingface - WARNING - Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
2025-08-17 01:42:53 - lm_eval.evaluator - INFO - Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-08-17 01:42:53 - lm_eval.evaluator - INFO - Using pre-initialized model
2025-08-17 01:43:08 - lm_eval.api.task - WARNING - [Task: boolq] metric acc is defined, but aggregation is not. using default aggregation=mean
2025-08-17 01:43:08 - lm_eval.api.task - WARNING - [Task: boolq] metric acc is defined, but higher_is_better is not. using default higher_is_better=True
2025-08-17 01:43:12 - lm_eval.api.task - INFO - Building contexts for boolq on rank 0...
2025-08-17 01:43:13 - lm_eval.evaluator - INFO - Running loglikelihood requests
2025-08-17 01:46:46 - eval_util - INFO - ✅ Task boolq completed successfully
2025-08-17 01:46:46 - eval_util - INFO - GPU Memory after boolq: 3.81GB
2025-08-17 01:46:46 - eval_util - INFO - Starting evaluation of task: hellaswag
2025-08-17 01:46:46 - eval_util - INFO - GPU Memory before hellaswag: 3.81GB
2025-08-17 01:46:46 - lm_eval.models.huggingface - WARNING - `pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
2025-08-17 01:46:46 - lm_eval.models.huggingface - WARNING - Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
2025-08-17 01:46:46 - lm_eval.evaluator - INFO - Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-08-17 01:46:46 - lm_eval.evaluator - INFO - Using pre-initialized model
2025-08-17 01:47:04 - lm_eval.api.task - INFO - Building contexts for hellaswag on rank 0...
2025-08-17 01:47:08 - lm_eval.evaluator - INFO - Running loglikelihood requests
2025-08-17 02:20:40 - eval_util - INFO - ✅ Task hellaswag completed successfully
2025-08-17 02:20:40 - eval_util - INFO - GPU Memory after hellaswag: 3.81GB
2025-08-17 02:20:40 - eval_util - INFO - Starting evaluation of task: winogrande
2025-08-17 02:20:40 - eval_util - INFO - GPU Memory before winogrande: 3.81GB
2025-08-17 02:20:40 - lm_eval.models.huggingface - WARNING - `pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
2025-08-17 02:20:40 - lm_eval.models.huggingface - WARNING - Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
2025-08-17 02:20:40 - lm_eval.evaluator - INFO - Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-08-17 02:20:40 - lm_eval.evaluator - INFO - Using pre-initialized model
2025-08-17 02:20:56 - lm_eval.api.task - INFO - Building contexts for winogrande on rank 0...
2025-08-17 02:20:57 - lm_eval.evaluator - INFO - Running loglikelihood requests
2025-08-17 02:22:45 - eval_util - INFO - ✅ Task winogrande completed successfully
2025-08-17 02:22:45 - eval_util - INFO - GPU Memory after winogrande: 3.81GB
2025-08-17 02:22:45 - eval_util - INFO - Starting evaluation of task: arc_easy
2025-08-17 02:22:45 - eval_util - INFO - GPU Memory before arc_easy: 3.81GB
2025-08-17 02:22:45 - lm_eval.models.huggingface - WARNING - `pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
2025-08-17 02:22:45 - lm_eval.models.huggingface - WARNING - Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
2025-08-17 02:22:45 - lm_eval.evaluator - INFO - Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-08-17 02:22:45 - lm_eval.evaluator - INFO - Using pre-initialized model
2025-08-17 02:22:59 - lm_eval.api.task - INFO - Building contexts for arc_easy on rank 0...
2025-08-17 02:23:01 - lm_eval.evaluator - INFO - Running loglikelihood requests
2025-08-17 02:29:31 - eval_util - INFO - ✅ Task arc_easy completed successfully
2025-08-17 02:29:31 - eval_util - INFO - GPU Memory after arc_easy: 3.81GB
2025-08-17 02:29:31 - eval_util - INFO - Starting evaluation of task: arc_challenge
2025-08-17 02:29:31 - eval_util - INFO - GPU Memory before arc_challenge: 3.81GB
2025-08-17 02:29:31 - lm_eval.models.huggingface - WARNING - `pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
2025-08-17 02:29:31 - lm_eval.models.huggingface - WARNING - Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
2025-08-17 02:29:31 - lm_eval.evaluator - INFO - Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-08-17 02:29:31 - lm_eval.evaluator - INFO - Using pre-initialized model
2025-08-17 02:29:44 - lm_eval.api.task - INFO - Building contexts for arc_challenge on rank 0...
2025-08-17 02:29:45 - lm_eval.evaluator - INFO - Running loglikelihood requests
2025-08-17 02:33:07 - eval_util - INFO - ✅ Task arc_challenge completed successfully
2025-08-17 02:33:07 - eval_util - INFO - GPU Memory after arc_challenge: 3.81GB
2025-08-17 02:33:07 - eval_util - INFO - Evaluation Summary - Successful: 6, Failed: 0
2025-08-17 02:33:07 - eval_util - INFO - result:
2025-08-17 02:33:07 - eval_util - INFO - 	{'arc_challenge': {'acc,none': 0.4197952218430034,
2025-08-17 02:33:07 - eval_util - INFO - 	                   'acc_norm,none': 0.4334470989761092,
2025-08-17 02:33:07 - eval_util - INFO - 	                   'acc_norm_stderr,none': 0.0144813762245589,
2025-08-17 02:33:07 - eval_util - INFO - 	                   'acc_stderr,none': 0.014422181226303026,
2025-08-17 02:33:07 - eval_util - INFO - 	                   'alias': 'arc_challenge'},
2025-08-17 02:33:07 - eval_util - INFO - 	 'arc_easy': {'acc,none': 0.7495791245791246,
2025-08-17 02:33:07 - eval_util - INFO - 	              'acc_norm,none': 0.7095959595959596,
2025-08-17 02:33:07 - eval_util - INFO - 	              'acc_norm_stderr,none': 0.009314833302936285,
2025-08-17 02:33:07 - eval_util - INFO - 	              'acc_stderr,none': 0.008890213675113957,
2025-08-17 02:33:07 - eval_util - INFO - 	              'alias': 'arc_easy'},
2025-08-17 02:33:07 - eval_util - INFO - 	 'boolq': {'acc,none': 0.7299694189602447,
2025-08-17 02:33:07 - eval_util - INFO - 	           'acc_stderr,none': 0.007765176800187589,
2025-08-17 02:33:07 - eval_util - INFO - 	           'alias': 'boolq'},
2025-08-17 02:33:07 - eval_util - INFO - 	 'hellaswag': {'acc,none': 0.5633339972117108,
2025-08-17 02:33:07 - eval_util - INFO - 	               'acc_norm,none': 0.7550288787094205,
2025-08-17 02:33:07 - eval_util - INFO - 	               'acc_norm_stderr,none': 0.004291911350430638,
2025-08-17 02:33:07 - eval_util - INFO - 	               'acc_stderr,none': 0.004949589567678914,
2025-08-17 02:33:07 - eval_util - INFO - 	               'alias': 'hellaswag'},
2025-08-17 02:33:07 - eval_util - INFO - 	 'wikitext': 6.5926737785339355,
2025-08-17 02:33:07 - eval_util - INFO - 	 'winogrande': {'acc,none': 0.6858721389108129,
2025-08-17 02:33:07 - eval_util - INFO - 	                'acc_stderr,none': 0.01304541671607257,
2025-08-17 02:33:07 - eval_util - INFO - 	                'alias': 'winogrande'}}
