will it work on rtx 5090 32gb vram ?

#5
by arunsahu44 - opened

will it work on rtx 5090 32gb vram ?

It will

It would work even in FP8 (with a pretty low context size)

I tried it on an RTX 5090, it does not work (out of memory). I used vllm nightly and the parameters from the model card: vllm serve nvidia/Qwen3.6-27B-NVFP4 --port 8000 --quantization modelopt --max-model-len 262144 --reasoning-parser qwen3. Even reducing max-model-len to 8192 I get an OOM. There is a line in the output which looks suspicious:

(EngineCore pid=40152) WARNING 07-01 13:04:02 [marlin.py:34] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.

AFAIK, RTX 5090 should support FP4. Any ideas why this fails?

yes, the OG model does fit

If anybody has a working configuration for RTX 5090, could you please post it? I cannot get it working, I always get OOM (even after reading https://www.reddit.com/r/LocalLLaMA/comments/1my3why/rtx_pro_6000_maxq_blackwell_for_llm/)

Sign up or log in to comment