INFO 04-05 14:02:04 [__init__.py:239] Automatically detected platform cuda.
INFO 04-05 14:02:06 [api_server.py:981] vLLM API server version 0.8.2
INFO 04-05 14:02:06 [api_server.py:982] args: Namespace(subparser='serve', model_tag='/mnt/cache/sharemath/models/Qwen/Qwen2.5-VL-32B-Instruct', config='', host='0.0.0.0', port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/mnt/cache/sharemath/models/Qwen/Qwen2.5-VL-32B-Instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='bfloat16', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=4, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.8, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt={'image': 5, 'video': 5}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7ff0f6f1ea70>)
INFO 04-05 14:02:20 [config.py:585] This model supports multiple tasks: {'score', 'classify', 'embed', 'reward', 'generate'}. Defaulting to 'generate'.
INFO 04-05 14:02:20 [config.py:1519] Defaulting to use mp for distributed inference
INFO 04-05 14:02:20 [config.py:1697] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-05 14:02:31 [__init__.py:239] Automatically detected platform cuda.
INFO 04-05 14:02:35 [core.py:54] Initializing a V1 LLM engine (v0.8.2) with config: model='/mnt/cache/sharemath/models/Qwen/Qwen2.5-VL-32B-Instruct', speculative_config=None, tokenizer='/mnt/cache/sharemath/models/Qwen/Qwen2.5-VL-32B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/mnt/cache/sharemath/models/Qwen/Qwen2.5-VL-32B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 04-05 14:02:35 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 56 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 04-05 14:02:35 [shm_broadcast.py:259] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 10485760, 10, 'psm_28ed2b57'), local_subscribe_addr='ipc:///tmp/46130e9e-968d-4e72-a2a8-ae8c1c13257c', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-05 14:02:44 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-05 14:02:49 [utils.py:2321] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f42e1483520>
[1;36m(VllmWorker rank=0 pid=15421)[0;0m INFO 04-05 14:02:49 [shm_broadcast.py:259] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_fe4d5b8e'), local_subscribe_addr='ipc:///tmp/3e1c5128-0931-44e0-a3bf-adc5c27b47bd', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-05 14:02:58 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-05 14:03:03 [utils.py:2321] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f64814d3550>
[1;36m(VllmWorker rank=1 pid=15456)[0;0m INFO 04-05 14:03:03 [shm_broadcast.py:259] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_b9df2f5b'), local_subscribe_addr='ipc:///tmp/91cedcc6-4407-471b-bf32-4280cd142f8c', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-05 14:03:12 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-05 14:03:17 [utils.py:2321] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f74bd6ff550>
[1;36m(VllmWorker rank=2 pid=15506)[0;0m INFO 04-05 14:03:17 [shm_broadcast.py:259] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_4bdba6e6'), local_subscribe_addr='ipc:///tmp/10080b62-aad6-49cb-9cd6-56f766fff757', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-05 14:03:26 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-05 14:03:31 [utils.py:2321] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fcbcd077490>
[1;36m(VllmWorker rank=3 pid=15568)[0;0m INFO 04-05 14:03:31 [shm_broadcast.py:259] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_498b8a30'), local_subscribe_addr='ipc:///tmp/de76a50a-0205-4e57-97b0-12ee01ea56bf', remote_subscribe_addr=None, remote_addr_ipv6=False)
[1;36m(VllmWorker rank=3 pid=15568)[0;0m INFO 04-05 14:03:31 [utils.py:931] Found nccl from library libnccl.so.2
[1;36m(VllmWorker rank=0 pid=15421)[0;0m INFO 04-05 14:03:31 [utils.py:931] Found nccl from library libnccl.so.2
[1;36m(VllmWorker rank=2 pid=15506)[0;0m INFO 04-05 14:03:31 [utils.py:931] Found nccl from library libnccl.so.2
[1;36m(VllmWorker rank=1 pid=15456)[0;0m INFO 04-05 14:03:31 [utils.py:931] Found nccl from library libnccl.so.2
[1;36m(VllmWorker rank=3 pid=15568)[0;0m INFO 04-05 14:03:31 [pynccl.py:69] vLLM is using nccl==2.21.5
[1;36m(VllmWorker rank=0 pid=15421)[0;0m INFO 04-05 14:03:31 [pynccl.py:69] vLLM is using nccl==2.21.5
[1;36m(VllmWorker rank=2 pid=15506)[0;0m INFO 04-05 14:03:31 [pynccl.py:69] vLLM is using nccl==2.21.5
[1;36m(VllmWorker rank=1 pid=15456)[0;0m INFO 04-05 14:03:31 [pynccl.py:69] vLLM is using nccl==2.21.5
[1;36m(VllmWorker rank=2 pid=15506)[0;0m INFO 04-05 14:03:32 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
[1;36m(VllmWorker rank=3 pid=15568)[0;0m INFO 04-05 14:03:32 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
[1;36m(VllmWorker rank=0 pid=15421)[0;0m INFO 04-05 14:03:32 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
[1;36m(VllmWorker rank=1 pid=15456)[0;0m INFO 04-05 14:03:32 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
[1;36m(VllmWorker rank=0 pid=15421)[0;0m INFO 04-05 14:03:32 [shm_broadcast.py:259] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_1c421357'), local_subscribe_addr='ipc:///tmp/cf1b547a-c273-43ea-8595-ac4ec8a833d7', remote_subscribe_addr=None, remote_addr_ipv6=False)
[1;36m(VllmWorker rank=1 pid=15456)[0;0m INFO 04-05 14:03:32 [parallel_state.py:954] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1
[1;36m(VllmWorker rank=2 pid=15506)[0;0m INFO 04-05 14:03:32 [parallel_state.py:954] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2
[1;36m(VllmWorker rank=0 pid=15421)[0;0m INFO 04-05 14:03:32 [parallel_state.py:954] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0
[1;36m(VllmWorker rank=3 pid=15568)[0;0m INFO 04-05 14:03:32 [parallel_state.py:954] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3
[1;36m(VllmWorker rank=2 pid=15506)[0;0m INFO 04-05 14:03:32 [cuda.py:220] Using Flash Attention backend on V1 engine.
[1;36m(VllmWorker rank=0 pid=15421)[0;0m INFO 04-05 14:03:32 [cuda.py:220] Using Flash Attention backend on V1 engine.
[1;36m(VllmWorker rank=1 pid=15456)[0;0m INFO 04-05 14:03:32 [cuda.py:220] Using Flash Attention backend on V1 engine.
[1;36m(VllmWorker rank=3 pid=15568)[0;0m INFO 04-05 14:03:32 [cuda.py:220] Using Flash Attention backend on V1 engine.
[1;36m(VllmWorker rank=3 pid=15568)[0;0m INFO 04-05 14:03:33 [gpu_model_runner.py:1174] Starting to load model /mnt/cache/sharemath/models/Qwen/Qwen2.5-VL-32B-Instruct...
[1;36m(VllmWorker rank=2 pid=15506)[0;0m INFO 04-05 14:03:33 [gpu_model_runner.py:1174] Starting to load model /mnt/cache/sharemath/models/Qwen/Qwen2.5-VL-32B-Instruct...
[1;36m(VllmWorker rank=1 pid=15456)[0;0m INFO 04-05 14:03:33 [gpu_model_runner.py:1174] Starting to load model /mnt/cache/sharemath/models/Qwen/Qwen2.5-VL-32B-Instruct...
[1;36m(VllmWorker rank=0 pid=15421)[0;0m INFO 04-05 14:03:33 [gpu_model_runner.py:1174] Starting to load model /mnt/cache/sharemath/models/Qwen/Qwen2.5-VL-32B-Instruct...
[1;36m(VllmWorker rank=1 pid=15456)[0;0m WARNING 04-05 14:03:33 [vision.py:97] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
[1;36m(VllmWorker rank=3 pid=15568)[0;0m WARNING 04-05 14:03:33 [vision.py:97] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
[1;36m(VllmWorker rank=0 pid=15421)[0;0m WARNING 04-05 14:03:33 [vision.py:97] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
[1;36m(VllmWorker rank=2 pid=15506)[0;0m WARNING 04-05 14:03:33 [vision.py:97] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
[1;36m(VllmWorker rank=1 pid=15456)[0;0m INFO 04-05 14:03:33 [config.py:3243] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
[1;36m(VllmWorker rank=2 pid=15506)[0;0m INFO 04-05 14:03:33 [config.py:3243] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
[1;36m(VllmWorker rank=0 pid=15421)[0;0m INFO 04-05 14:03:33 [config.py:3243] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
[1;36m(VllmWorker rank=3 pid=15568)[0;0m INFO 04-05 14:03:33 [config.py:3243] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
[1;36m(VllmWorker rank=3 pid=15568)[0;0m WARNING 04-05 14:03:34 [topk_topp_sampler.py:63] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
[1;36m(VllmWorker rank=2 pid=15506)[0;0m WARNING 04-05 14:03:34 [topk_topp_sampler.py:63] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
[1;36m(VllmWorker rank=0 pid=15421)[0;0m WARNING 04-05 14:03:34 [topk_topp_sampler.py:63] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
[1;36m(VllmWorker rank=1 pid=15456)[0;0m WARNING 04-05 14:03:34 [topk_topp_sampler.py:63] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
[1;36m(VllmWorker rank=3 pid=15568)[0;0m INFO 04-05 14:05:13 [loader.py:447] Loading weights took 98.71 seconds
[1;36m(VllmWorker rank=1 pid=15456)[0;0m INFO 04-05 14:05:13 [loader.py:447] Loading weights took 98.55 seconds
[1;36m(VllmWorker rank=0 pid=15421)[0;0m INFO 04-05 14:05:13 [loader.py:447] Loading weights took 98.65 seconds
[1;36m(VllmWorker rank=2 pid=15506)[0;0m INFO 04-05 14:05:13 [loader.py:447] Loading weights took 98.65 seconds
CRITICAL 04-05 14:05:13 [multiproc_executor.py:49] MulitprocExecutor got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-05 14:05:13 [core_client.py:269] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
