Instructions to use brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172")
model = AutoModelForCausalLM.from_pretrained("brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172

SGLang

How to use brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172 with Docker Model Runner:
```
docker model run hf.co/brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172
```

vLLM hangs forever at startup unless --disable-custom-all-reduce is passed

by TempleOfApshai - opened 10 days ago

Discussion

TempleOfApshai

10 days ago

Running on AMD EPYC 9b45 with 4x RTX 6000 PRO Workstation @ 425W. Using the docker image and model shared in the model card.

Launch:

$ ./serve_glm52_reap_recall.sh && docker logs glm52-recall -f
== preflight ==
image=verdictai/gloriousluminousmonotheism:latest model=GLM-5.2-NVFP4-REAP-Recall-N172 util=0.97 maxlen=204800 dcp=4 mtp=1 num_spec=3 max_batched=4096 moe_a16=1 gtopk=1 kv=fp8
WARNING: Published ports are discarded when using host network mode
8fcf0ca507376deaa8761d170b612fb924db64ea007646d2e6351d09d2e2657e
Launched glm52-recall (tp=4 util=0.97 maxlen=204800 dcp=4 mtp=1 num_spec=3 batched=4096)
watch boot:  docker logs -f glm52-recall
smoke test:  curl -s http://127.0.0.1:8080/v1/models
completion:  curl -s http://127.0.0.1:8080/v1/chat/completions -d '{"model":"glm-5.2-nvfp4","messages":[{"role":"user","content":"what is the capital of kentucky?"}]}'

==========
== CUDA ==
==========

CUDA Version 13.2.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

/opt/venv/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
DEBUG 06-23 12:39:21 [plugins/__init__.py:36] No plugins for group vllm.platform_plugins found.
DEBUG 06-23 12:39:21 [platforms/__init__.py:37] Checking if TPU platform is available.
DEBUG 06-23 12:39:21 [platforms/__init__.py:56] TPU platform is not available because: No module named 'libtpu'
DEBUG 06-23 12:39:21 [platforms/__init__.py:62] Checking if CUDA platform is available.
DEBUG 06-23 12:39:21 [platforms/__init__.py:85] Confirmed CUDA platform is available.
DEBUG 06-23 12:39:21 [platforms/__init__.py:113] Checking if ROCm platform is available.
DEBUG 06-23 12:39:21 [platforms/__init__.py:127] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 06-23 12:39:21 [platforms/__init__.py:134] Checking if XPU platform is available.
DEBUG 06-23 12:39:21 [platforms/__init__.py:165] Checking if CPU platform is available.
DEBUG 06-23 12:39:21 [platforms/__init__.py:62] Checking if CUDA platform is available.
DEBUG 06-23 12:39:21 [platforms/__init__.py:85] Confirmed CUDA platform is available.
DEBUG 06-23 12:39:21 [platforms/__init__.py:246] Automatically detected platform cuda.
DEBUG 06-23 12:39:25 [entrypoints/.../utils/api_utils.py:166] Setting VLLM_WORKER_MULTIPROC_METHOD to 'spawn'
DEBUG 06-23 12:39:25 [plugins/__init__.py:44] Available plugins for group vllm.general_plugins:
DEBUG 06-23 12:39:25 [plugins/__init__.py:46] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 06-23 12:39:25 [plugins/__init__.py:46] - lora_hf_hub_resolver -> vllm.plugins.lora_resolvers.hf_hub_resolver:register_hf_hub_resolver
DEBUG 06-23 12:39:25 [plugins/__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
(APIServer pid=1) INFO 06-23 12:39:25 [entrypoints/.../utils/api_utils.py:339]
(APIServer pid=1) INFO 06-23 12:39:25 [entrypoints/.../utils/api_utils.py:339]        █     █     █▄   ▄█
(APIServer pid=1) INFO 06-23 12:39:25 [entrypoints/.../utils/api_utils.py:339]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.11.2.dev279+glm52.v11.darkdevotion.a86f74e.b12x5b2e018.cu132.20260618
(APIServer pid=1) INFO 06-23 12:39:25 [entrypoints/.../utils/api_utils.py:339]   █▄█▀ █     █     █     █  model   /models/GLM-5.2-NVFP4-REAP-Recall-N172
(APIServer pid=1) INFO 06-23 12:39:25 [entrypoints/.../utils/api_utils.py:339]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 06-23 12:39:25 [entrypoints/.../utils/api_utils.py:339]
(APIServer pid=1) INFO 06-23 12:39:25 [entrypoints/.../utils/api_utils.py:273] non-default args: {'model_tag': '/models/GLM-5.2-NVFP4-REAP-Recall-N172', 'enable_auto_tool_choice': True, 'tool_call_parser': 'glm47', 'host': '0.0.0.0', 'port': 8080, 'model': '/models/GLM-5.2-NVFP4-REAP-Recall-N172', 'max_model_len': 204800, 'served_model_name': ['glm-5.2-nvfp4'], 'hf_overrides': {'use_index_cache': True, 'index_topk_pattern': 'FFFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSS'}, 'load_format': 'safetensors', 'attention_backend': 'B12X_MLA_SPARSE', 'reasoning_parser': 'glm45', 'tensor_parallel_size': 4, 'decode_context_parallel_size': 4, 'block_size': 256, 'gpu_memory_utilization': 0.97, 'kv_cache_dtype': 'fp8', 'max_num_batched_tokens': 4096, 'max_num_seqs': 4, 'enable_chunked_prefill': True, 'max_cudagraph_capture_size': 64, 'enable_flashinfer_autotune': True, 'moe_backend': 'b12x', 'speculative_config': {'method': 'mtp', 'num_speculative_tokens': 3, 'draft_sample_method': 'probabilistic', 'moe_backend': 'b12x', 'use_local_argmax_reduction': True}, 'compilation_config': {'mode': None, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': [], 'ir_enable_torch_wrap': None, 'splitting_ops': None, 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': None, 'compile_ranges_endpoints': None, 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.PIECEWISE: 1>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': None, 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': None, 'pass_config': {}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': None, 'static_all_moe_layers': []}}
(APIServer pid=1) WARNING 06-23 12:39:25 [envs.py:2070] Unknown vLLM environment variable detected: VLLM_DCP_SHARD_DRAFT
(APIServer pid=1) WARNING 06-23 12:39:25 [envs.py:2070] Unknown vLLM environment variable detected: VLLM_RTX6K_FUSED_ALLREDUCE_ADD
(APIServer pid=1) WARNING 06-23 12:39:25 [envs.py:2070] Unknown vLLM environment variable detected: VLLM_CACHE_DIR
(APIServer pid=1) WARNING 06-23 12:39:25 [envs.py:2070] Unknown vLLM environment variable detected: VLLM_DCP_GLOBAL_TOPK
(APIServer pid=1) WARNING 06-23 12:39:25 [envs.py:2070] Unknown vLLM environment variable detected: VLLM_CPP_AR_IGNORE_CUTOFF_MAX_ROWS
(APIServer pid=1) WARNING 06-23 12:39:25 [envs.py:2070] Unknown vLLM environment variable detected: VLLM_RTX6K_FUSED_ALLREDUCE_ADD_END_BARRIER
(APIServer pid=1) WARNING 06-23 12:39:25 [envs.py:2070] Unknown vLLM environment variable detected: VLLM_MEMORY_PROFILE_INCLUDE_ATTN
(APIServer pid=1) WARNING 06-23 12:39:25 [envs.py:2070] Unknown vLLM environment variable detected: VLLM_CPP_AR_1STAGE_NCCL_CUTOFF
(APIServer pid=1) DEBUG 06-23 12:39:25 [transformers_utils/config.py:760] Overriding HF config with {'use_index_cache': True, 'index_topk_pattern': 'FFFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSS'}
(APIServer pid=1) DEBUG 06-23 12:39:25 [model_executor/models/registry.py:920] Loaded model info for class vllm.model_executor.models.deepseek_v2.GlmMoeDsaForCausalLM from cache
(APIServer pid=1) DEBUG 06-23 12:39:25 [logging_utils/log_time.py:29] Registry inspect model class: Elapsed time 0.0003187 secs
(APIServer pid=1) INFO 06-23 12:39:25 [config/model.py:598] Resolved architecture: GlmMoeDsaForCausalLM
(APIServer pid=1) INFO 06-23 12:39:25 [config/model.py:1723] Using max model len 204800
(APIServer pid=1) DEBUG 06-23 12:39:25 [utils/import_utils.py:67] Loading module triton_kernels from /opt/venv/lib/python3.12/site-packages/triton_kernels/__init__.py.
(APIServer pid=1) DEBUG 06-23 12:39:27 [compilation/decorators.py:221] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.deepseek_v2.DeepseekV2Model'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
(APIServer pid=1) DEBUG 06-23 12:39:27 [compilation/decorators.py:221] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.deepseek_mtp.DeepSeekMTP'>: ['input_ids', 'positions', 'hidden_states', 'intermediate_tensors', 'inputs_embeds']
(APIServer pid=1) DEBUG 06-23 12:39:27 [config/model.py:1788] Generative models support chunked prefill.
(APIServer pid=1) DEBUG 06-23 12:39:27 [config/model.py:1846] Generative models support prefix caching.
(APIServer pid=1) DEBUG 06-23 12:39:27 [engine/arg_utils.py:2434] Enabling prefix caching by default
(APIServer pid=1) INFO 06-23 12:39:27 [config/cache.py:280] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
(APIServer pid=1) DEBUG 06-23 12:39:27 [config/parallel.py:902] Defaulting to use mp for distributed inference
(APIServer pid=1) DEBUG 06-23 12:39:27 [transformers_utils/config.py:763] Overriding HF config with <function SpeculativeConfig.hf_config_override at 0x76cf316d0cc0>
(APIServer pid=1) DEBUG 06-23 12:39:27 [model_executor/models/registry.py:920] Loaded model info for class vllm.model_executor.models.deepseek_mtp.DeepSeekMTP from cache
(APIServer pid=1) DEBUG 06-23 12:39:27 [logging_utils/log_time.py:29] Registry inspect model class: Elapsed time 0.0004504 secs
(APIServer pid=1) INFO 06-23 12:39:27 [config/model.py:598] Resolved architecture: DeepSeekMTPModel
(APIServer pid=1) INFO 06-23 12:39:27 [config/model.py:1723] Using max model len 1048576
(APIServer pid=1) WARNING 06-23 12:39:27 [config/speculative.py:909] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=1) INFO 06-23 12:39:27 [config/speculative.py:1072] Overriding draft model max model len from 1048576 to 204800
(APIServer pid=1) INFO 06-23 12:39:27 [config/scheduler.py:252] Chunked prefill is enabled with max_num_batched_tokens=4096.
(APIServer pid=1) WARNING 06-23 12:39:27 [model_executor/.../quantization/modelopt.py:1028] Detected ModelOpt NVFP4 checkpoint (quant_algo=NVFP4). Please note that the format is experimental and could change in future.
(APIServer pid=1) INFO 06-23 12:39:27 [config/vllm.py:1061] Asynchronous scheduling is enabled.
(APIServer pid=1) DEBUG 06-23 12:39:27 [config/kernel.py:260] Setting platform-specific IR op priority defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']), user-defined: IrOpPriorityConfig(rms_norm=[], fused_add_rms_norm=[])
(APIServer pid=1) INFO 06-23 12:39:27 [config/kernel.py:278] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(APIServer pid=1) WARNING 06-23 12:39:27 [config/vllm.py:1650] max_num_scheduled_tokens is set to 4096 based on the speculative decoding settings. This may lead to suboptimal performance. Consider increasing max_num_batched_tokens to accommodate the additional draft token slots, or decrease num_speculative_tokens or max_num_seqs.
(APIServer pid=1) INFO 06-23 12:39:27 [config/compilation.py:310] Enabled custom fusions: act_quant
(APIServer pid=1) DEBUG 06-23 12:39:27 [plugins/__init__.py:36] No plugins for group vllm.stat_logger_plugins found.
(APIServer pid=1) DEBUG 06-23 12:39:27 [tokenizers/registry.py:70] Loading CachedHfTokenizer for tokenizer_mode='hf'
(APIServer pid=1) DEBUG 06-23 12:39:28 [renderers/registry.py:57] Loading HfRenderer for renderer_mode='hf'
(APIServer pid=1) [transformers] The following generation flags are not valid and may be ignored: ['top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
/opt/venv/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
DEBUG 06-23 12:39:37 [plugins/__init__.py:36] No plugins for group vllm.platform_plugins found.
DEBUG 06-23 12:39:37 [platforms/__init__.py:37] Checking if TPU platform is available.
DEBUG 06-23 12:39:37 [platforms/__init__.py:56] TPU platform is not available because: No module named 'libtpu'
DEBUG 06-23 12:39:37 [platforms/__init__.py:62] Checking if CUDA platform is available.
DEBUG 06-23 12:39:37 [platforms/__init__.py:85] Confirmed CUDA platform is available.
DEBUG 06-23 12:39:37 [platforms/__init__.py:113] Checking if ROCm platform is available.
DEBUG 06-23 12:39:37 [platforms/__init__.py:127] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 06-23 12:39:37 [platforms/__init__.py:134] Checking if XPU platform is available.
DEBUG 06-23 12:39:37 [platforms/__init__.py:165] Checking if CPU platform is available.
DEBUG 06-23 12:39:37 [platforms/__init__.py:62] Checking if CUDA platform is available.
DEBUG 06-23 12:39:37 [platforms/__init__.py:85] Confirmed CUDA platform is available.
DEBUG 06-23 12:39:37 [platforms/__init__.py:246] Automatically detected platform cuda.
DEBUG 06-23 12:39:38 [utils/import_utils.py:67] Loading module triton_kernels from /opt/venv/lib/python3.12/site-packages/triton_kernels/__init__.py.
(EngineCore pid=653) DEBUG 06-23 12:39:38 [v1/engine/core.py:1132] Waiting for init message from front-end.
(APIServer pid=1) DEBUG 06-23 12:39:38 [v1/engine/utils.py:1337] HELLO from local core engine process 0.
(EngineCore pid=653) DEBUG 06-23 12:39:38 [v1/engine/core.py:1143] Received init message: EngineHandshakeMetadata(addresses=EngineZmqAddresses(inputs=['ipc:///tmp/f5082110-05d5-4329-a586-61f1a7838ccd'], outputs=['ipc:///tmp/40f6f34a-5689-4038-8c68-90c7d39506d1'], coordinator_input=None, coordinator_output=None, frontend_stats_publish_address=None), parallel_config={})
(EngineCore pid=653) DEBUG 06-23 12:39:38 [v1/engine/core.py:942] Has DP Coordinator: False, stats publish address: None
(EngineCore pid=653) DEBUG 06-23 12:39:38 [plugins/__init__.py:44] Available plugins for group vllm.general_plugins:
(EngineCore pid=653) DEBUG 06-23 12:39:38 [plugins/__init__.py:46] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
(EngineCore pid=653) DEBUG 06-23 12:39:38 [plugins/__init__.py:46] - lora_hf_hub_resolver -> vllm.plugins.lora_resolvers.hf_hub_resolver:register_hf_hub_resolver
(EngineCore pid=653) DEBUG 06-23 12:39:38 [plugins/__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
(EngineCore pid=653) INFO 06-23 12:39:38 [v1/engine/core.py:114] Initializing a V1 LLM engine (v0.11.2.dev279+glm52.v11.darkdevotion.a86f74e.b12x5b2e018.cu132.20260618) with config: model='/models/GLM-5.2-NVFP4-REAP-Recall-N172', speculative_config=SpeculativeConfig(method='mtp', model='/models/GLM-5.2-NVFP4-REAP-Recall-N172', num_spec_tokens=3), tokenizer='/models/GLM-5.2-NVFP4-REAP-Recall-N172', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=204800, download_dir=None, load_format=safetensors, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=4, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=modelopt_fp4, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='glm45', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False, jit_monitor_verbose=False), seed=0, served_model_name=glm-5.2-nvfp4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::qwen_gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::minimax_m3_sparse_attention_with_output', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update', 'vllm::minimax_m3_sparse_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.PIECEWISE: 1>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 64, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='b12x', linear_backend='auto')
(EngineCore pid=653) WARNING 06-23 12:39:38 [v1/executor/multiproc_executor.py:1053] Reducing Torch parallelism from 128 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore pid=653) INFO 06-23 12:39:38 [v1/executor/multiproc_executor.py:140] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=192.168.4.2 (local), world_size=4, local_world_size=4
(EngineCore pid=653) DEBUG 06-23 12:39:38 [distributed/device_communicators/shm_broadcast.py:402] Binding to ipc:///tmp/789c6c79-7f07-43c7-9253-523fffe51531
(EngineCore pid=653) DEBUG 06-23 12:39:38 [distributed/device_communicators/shm_broadcast.py:455] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, 'psm_b1dccc18'), local_subscribe_addr='ipc:///tmp/789c6c79-7f07-43c7-9253-523fffe51531', local_notify_addr='ipc:///tmp/2082cc14-6380-4606-b0b1-90862566cae3', remote_subscribe_addr=None, remote_addr_ipv6=False)
/opt/venv/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/opt/venv/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/opt/venv/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/opt/venv/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
DEBUG 06-23 12:39:44 [plugins/__init__.py:36] No plugins for group vllm.platform_plugins found.
DEBUG 06-23 12:39:44 [platforms/__init__.py:37] Checking if TPU platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:56] TPU platform is not available because: No module named 'libtpu'
DEBUG 06-23 12:39:44 [platforms/__init__.py:62] Checking if CUDA platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:85] Confirmed CUDA platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:113] Checking if ROCm platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:127] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 06-23 12:39:44 [platforms/__init__.py:134] Checking if XPU platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:165] Checking if CPU platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:62] Checking if CUDA platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:85] Confirmed CUDA platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:246] Automatically detected platform cuda.
DEBUG 06-23 12:39:44 [plugins/__init__.py:36] No plugins for group vllm.platform_plugins found.
DEBUG 06-23 12:39:44 [platforms/__init__.py:37] Checking if TPU platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:56] TPU platform is not available because: No module named 'libtpu'
DEBUG 06-23 12:39:44 [platforms/__init__.py:62] Checking if CUDA platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:85] Confirmed CUDA platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:113] Checking if ROCm platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:127] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 06-23 12:39:44 [platforms/__init__.py:134] Checking if XPU platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:165] Checking if CPU platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:62] Checking if CUDA platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:85] Confirmed CUDA platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:246] Automatically detected platform cuda.
DEBUG 06-23 12:39:44 [plugins/__init__.py:36] No plugins for group vllm.platform_plugins found.
DEBUG 06-23 12:39:44 [platforms/__init__.py:37] Checking if TPU platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:56] TPU platform is not available because: No module named 'libtpu'
DEBUG 06-23 12:39:44 [platforms/__init__.py:62] Checking if CUDA platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:85] Confirmed CUDA platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:113] Checking if ROCm platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:127] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 06-23 12:39:44 [platforms/__init__.py:134] Checking if XPU platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:165] Checking if CPU platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:62] Checking if CUDA platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:85] Confirmed CUDA platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:246] Automatically detected platform cuda.
DEBUG 06-23 12:39:44 [plugins/__init__.py:36] No plugins for group vllm.platform_plugins found.
DEBUG 06-23 12:39:44 [platforms/__init__.py:37] Checking if TPU platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:56] TPU platform is not available because: No module named 'libtpu'
DEBUG 06-23 12:39:44 [platforms/__init__.py:62] Checking if CUDA platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:85] Confirmed CUDA platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:113] Checking if ROCm platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:127] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 06-23 12:39:44 [platforms/__init__.py:134] Checking if XPU platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:165] Checking if CPU platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:62] Checking if CUDA platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:85] Confirmed CUDA platform is available.
DEBUG 06-23 12:39:44 [platforms/__init__.py:246] Automatically detected platform cuda.
DEBUG 06-23 12:39:47 [utils/import_utils.py:67] Loading module triton_kernels from /opt/venv/lib/python3.12/site-packages/triton_kernels/__init__.py.
DEBUG 06-23 12:39:47 [utils/import_utils.py:67] Loading module triton_kernels from /opt/venv/lib/python3.12/site-packages/triton_kernels/__init__.py.
DEBUG 06-23 12:39:47 [utils/import_utils.py:67] Loading module triton_kernels from /opt/venv/lib/python3.12/site-packages/triton_kernels/__init__.py.
DEBUG 06-23 12:39:47 [utils/import_utils.py:67] Loading module triton_kernels from /opt/venv/lib/python3.12/site-packages/triton_kernels/__init__.py.
DEBUG 06-23 12:39:47 [plugins/__init__.py:44] Available plugins for group vllm.general_plugins:
DEBUG 06-23 12:39:47 [plugins/__init__.py:46] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 06-23 12:39:47 [plugins/__init__.py:46] - lora_hf_hub_resolver -> vllm.plugins.lora_resolvers.hf_hub_resolver:register_hf_hub_resolver
DEBUG 06-23 12:39:47 [plugins/__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
DEBUG 06-23 12:39:47 [plugins/__init__.py:44] Available plugins for group vllm.general_plugins:
DEBUG 06-23 12:39:47 [plugins/__init__.py:46] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 06-23 12:39:47 [plugins/__init__.py:46] - lora_hf_hub_resolver -> vllm.plugins.lora_resolvers.hf_hub_resolver:register_hf_hub_resolver
DEBUG 06-23 12:39:47 [plugins/__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
DEBUG 06-23 12:39:47 [plugins/__init__.py:44] Available plugins for group vllm.general_plugins:
DEBUG 06-23 12:39:47 [plugins/__init__.py:46] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 06-23 12:39:47 [plugins/__init__.py:46] - lora_hf_hub_resolver -> vllm.plugins.lora_resolvers.hf_hub_resolver:register_hf_hub_resolver
DEBUG 06-23 12:39:47 [plugins/__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
DEBUG 06-23 12:39:47 [plugins/__init__.py:44] Available plugins for group vllm.general_plugins:
DEBUG 06-23 12:39:47 [plugins/__init__.py:46] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 06-23 12:39:47 [plugins/__init__.py:46] - lora_hf_hub_resolver -> vllm.plugins.lora_resolvers.hf_hub_resolver:register_hf_hub_resolver
DEBUG 06-23 12:39:47 [plugins/__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
(APIServer pid=1) DEBUG 06-23 12:39:48 [v1/engine/utils.py:1239] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 06-23 12:39:49 [compilation/decorators.py:221] Inferred dynamic dimensions for forward method of <class 'vllm.models.minimax_m3.nvidia.model.MiniMaxM3Model'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
DEBUG 06-23 12:39:49 [compilation/decorators.py:221] Inferred dynamic dimensions for forward method of <class 'vllm.models.minimax_m3.nvidia.model.MiniMaxM3Model'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
DEBUG 06-23 12:39:49 [compilation/decorators.py:221] Inferred dynamic dimensions for forward method of <class 'vllm.models.minimax_m3.nvidia.model.MiniMaxM3Model'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
DEBUG 06-23 12:39:49 [compilation/decorators.py:221] Inferred dynamic dimensions for forward method of <class 'vllm.models.minimax_m3.nvidia.model.MiniMaxM3Model'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
DEBUG 06-23 12:39:49 [config/kernel.py:85] Setting IR op priority for rms_norm to ['native']
DEBUG 06-23 12:39:49 [ir/op.py:422] Priority for vllm.ir.rms_norm set to ['native']
DEBUG 06-23 12:39:49 [config/kernel.py:85] Setting IR op priority for fused_add_rms_norm to ['native']
DEBUG 06-23 12:39:49 [ir/op.py:422] Priority for vllm.ir.fused_add_rms_norm set to ['native']
(Worker pid=853) DEBUG 06-23 12:39:49 [distributed/parallel_state.py:1524] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:57305 backend=nccl
(Worker pid=853) INFO 06-23 12:39:49 [distributed/parallel_state.py:1568] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:57305 backend=nccl
[W623 12:39:49.568810363 socket.cpp:764] [c10d] The client socket cannot be initialized to connect to [localhost]:57305 (errno: 97 - Address family not supported by protocol).
DEBUG 06-23 12:39:49 [config/kernel.py:85] Setting IR op priority for rms_norm to ['native']
DEBUG 06-23 12:39:49 [ir/op.py:422] Priority for vllm.ir.rms_norm set to ['native']
DEBUG 06-23 12:39:49 [config/kernel.py:85] Setting IR op priority for fused_add_rms_norm to ['native']
DEBUG 06-23 12:39:49 [ir/op.py:422] Priority for vllm.ir.fused_add_rms_norm set to ['native']
DEBUG 06-23 12:39:49 [config/kernel.py:85] Setting IR op priority for rms_norm to ['native']
DEBUG 06-23 12:39:49 [ir/op.py:422] Priority for vllm.ir.rms_norm set to ['native']
DEBUG 06-23 12:39:49 [config/kernel.py:85] Setting IR op priority for fused_add_rms_norm to ['native']
DEBUG 06-23 12:39:49 [ir/op.py:422] Priority for vllm.ir.fused_add_rms_norm set to ['native']
DEBUG 06-23 12:39:49 [config/kernel.py:85] Setting IR op priority for rms_norm to ['native']
DEBUG 06-23 12:39:49 [ir/op.py:422] Priority for vllm.ir.rms_norm set to ['native']
DEBUG 06-23 12:39:49 [config/kernel.py:85] Setting IR op priority for fused_add_rms_norm to ['native']
DEBUG 06-23 12:39:49 [ir/op.py:422] Priority for vllm.ir.fused_add_rms_norm set to ['native']
(Worker pid=856) DEBUG 06-23 12:39:50 [distributed/parallel_state.py:1524] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:57305 backend=nccl
(Worker pid=856) INFO 06-23 12:39:50 [distributed/parallel_state.py:1568] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:57305 backend=nccl
[W623 12:39:50.041932544 socket.cpp:764] [c10d] The client socket cannot be initialized to connect to [localhost]:57305 (errno: 97 - Address family not supported by protocol).
(Worker pid=855) DEBUG 06-23 12:39:50 [distributed/parallel_state.py:1524] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:57305 backend=nccl
(Worker pid=855) INFO 06-23 12:39:50 [distributed/parallel_state.py:1568] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:57305 backend=nccl
(Worker pid=854) DEBUG 06-23 12:39:50 [distributed/parallel_state.py:1524] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:57305 backend=nccl
(Worker pid=854) INFO 06-23 12:39:50 [distributed/parallel_state.py:1568] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:57305 backend=nccl
[W623 12:39:50.089249403 socket.cpp:764] [c10d] The client socket cannot be initialized to connect to [localhost]:57305 (errno: 97 - Address family not supported by protocol).
[W623 12:39:50.090987365 socket.cpp:764] [c10d] The client socket cannot be initialized to connect to [localhost]:57305 (errno: 97 - Address family not supported by protocol).
(Worker pid=853) DEBUG 06-23 12:39:50 [distributed/parallel_state.py:1650] Detected 1 nodes in the distributed environment
(Worker pid=854) DEBUG 06-23 12:39:50 [distributed/parallel_state.py:1650] Detected 1 nodes in the distributed environment
(Worker pid=856) DEBUG 06-23 12:39:50 [distributed/parallel_state.py:1650] Detected 1 nodes in the distributed environment
(Worker pid=855) DEBUG 06-23 12:39:50 [distributed/parallel_state.py:1650] Detected 1 nodes in the distributed environment
(Worker pid=853) INFO 06-23 12:39:50 [utils/nccl.py:24] Found nccl from environment variable VLLM_NCCL_SO_PATH=/opt/libnccl-local-inference.so.2.30.4
(Worker pid=854) INFO 06-23 12:39:50 [utils/nccl.py:24] Found nccl from environment variable VLLM_NCCL_SO_PATH=/opt/libnccl-local-inference.so.2.30.4
(Worker pid=855) (Worker pid=856) INFO 06-23 12:39:50 [utils/nccl.py:24] Found nccl from environment variable VLLM_NCCL_SO_PATH=/opt/libnccl-local-inference.so.2.30.4
INFO 06-23 12:39:50 [utils/nccl.py:24] Found nccl from environment variable VLLM_NCCL_SO_PATH=/opt/libnccl-local-inference.so.2.30.4
(Worker pid=853) INFO 06-23 12:39:50 [distributed/device_communicators/pynccl.py:113] vLLM is using nccl==2.30.4
(Worker pid=856) (Worker pid=854) (Worker pid=853) (Worker pid=855) WARNING 06-23 12:39:50 [distributed/device_communicators/symm_mem.py:66] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
WARNING 06-23 12:39:50 [distributed/device_communicators/symm_mem.py:66] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
WARNING 06-23 12:39:50 [distributed/device_communicators/symm_mem.py:66] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
WARNING 06-23 12:39:50 [distributed/device_communicators/symm_mem.py:66] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
(Worker pid=853) INFO 06-23 12:39:50 [distributed/device_communicators/custom_all_reduce.py:328] b12x PCIe oneshot allreduce requested (world_size=4, physical_device_ids=[0, 1, 2, 4], fully_connected=False).
(Worker pid=856) INFO 06-23 12:39:50 [distributed/device_communicators/custom_all_reduce.py:328] b12x PCIe oneshot allreduce requested (world_size=4, physical_device_ids=[0, 1, 2, 4], fully_connected=False).
(Worker pid=854) INFO 06-23 12:39:50 [distributed/device_communicators/custom_all_reduce.py:328] b12x PCIe oneshot allreduce requested (world_size=4, physical_device_ids=[0, 1, 2, 4], fully_connected=False).
(Worker pid=855) INFO 06-23 12:39:50 [distributed/device_communicators/custom_all_reduce.py:328] b12x PCIe oneshot allreduce requested (world_size=4, physical_device_ids=[0, 1, 2, 4], fully_connected=False).
(Worker pid=853) DEBUG 06-23 12:39:50 [distributed/device_communicators/custom_all_reduce.py:193] Skipping P2P check and trusting the driver's P2P report.
(Worker pid=856) DEBUG 06-23 12:39:50 [distributed/device_communicators/custom_all_reduce.py:193] Skipping P2P check and trusting the driver's P2P report.
(Worker pid=854) DEBUG 06-23 12:39:50 [distributed/device_communicators/custom_all_reduce.py:193] Skipping P2P check and trusting the driver's P2P report.
(Worker pid=855) DEBUG 06-23 12:39:50 [distributed/device_communicators/custom_all_reduce.py:193] Skipping P2P check and trusting the driver's P2P report.
(APIServer pid=1) DEBUG 06-23 12:39:58 [v1/engine/utils.py:1239] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=1) DEBUG 06-23 12:40:08 [v1/engine/utils.py:1239] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=1) DEBUG 06-23 12:40:18 [v1/engine/utils.py:1239] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=1) DEBUG 06-23 12:40:28 [v1/engine/utils.py:1239] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=1) DEBUG 06-23 12:40:38 [v1/engine/utils.py:1239] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=1) DEBUG 06-23 12:40:48 [v1/engine/utils.py:1239] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=1) DEBUG 06-23 12:40:58 [v1/engine/utils.py:1239] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=1) DEBUG 06-23 12:41:08 [v1/engine/utils.py:1239] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=1) DEBUG 06-23 12:41:18 [v1/engine/utils.py:1239] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=1) DEBUG 06-23 12:41:28 [v1/engine/utils.py:1239] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=1) DEBUG 06-23 12:41:38 [v1/engine/utils.py:1239] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=1) DEBUG 06-23 12:41:48 [v1/engine/utils.py:1239] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=1) DEBUG 06-23 12:41:58 [v1/engine/utils.py:1239] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=1) DEBUG 06-23 12:42:08 [v1/engine/utils.py:1239] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=1) DEBUG 06-23 12:42:18 [v1/engine/utils.py:1239] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=1) DEBUG 06-23 12:42:28 [v1/engine/utils.py:1239] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=1) DEBUG 06-23 12:42:38 [v1/engine/utils.py:1239] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=1) DEBUG 06-23 12:42:48 [v1/engine/utils.py:1239] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=1) DEBUG 06-23 12:42:58 [v1/engine/utils.py:1239] Waiting for 1 local, 0 remote core engine proc(s) to start.

Passing --disable-custom-all-reduce on the vLLM command-line mitigates the problem and vLLM starts up correctly, everything just works.

Any idea why this might happen?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment