Anyone able to fit this in 32gb vram using 2 cards?

#5
by tasticleeze - opened

Thank you for the quant! Anyone able to fit this in 32gb vram using 2 cards?

Rtx5060ti 16G*2 ubuntu验证通过:
export NCCL_P2P_DISABLE=1
export NCCL_CUMEM_HOST_ENABLE=0
export MAX_JOBS=1
export NCCL_IB_DISABLE=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export VLLM_SLEEP_WHEN_IDLE=1

vllm serve /h/models/cyankiwi/Qwen3.6-27B-AWQ-BF16-INT4
--host ::
--port 1234
--pipeline-parallel-size 1
--tensor-parallel-size 2
--served-model-name Qwen3.6
--quantization compressed-tensors
--gpu-memory-utilization=0.9
--max-model-len=60000
--max-num-seqs=2
--block-size 32
--enable-chunked-prefill
--max-num-batched-tokens=2048
--enable-prefix-caching
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--reasoning-parser qwen3
--default-chat-template-kwargs '{"preserve_thinking":true}'
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
--language-model-only \
--skip-mm-profiling
--compilation-config '{"cudagraph_mode": "NONE"}'
--kv-cache-dtype fp8_e4m3
--cpu-offload-gb 15
--offload-backend prefetch
--offload-group-size 16
--offload-num-in-group 3

Sign up or log in to comment