will it work on rtx 5090 32gb vram ?

by arunsahu44 - opened 1 day ago

Discussion

arunsahu44

1 day ago

will it work on rtx 5090 32gb vram ?

klimekop6

about 24 hours ago

It will

encryptedoreo

about 18 hours ago

It would work even in FP8 (with a pretty low context size)

christian-winkler-th

about 5 hours ago

•

edited about 5 hours ago

I tried it on an RTX 5090, it does not work (out of memory). I used vllm nightly and the parameters from the model card: vllm serve nvidia/Qwen3.6-27B-NVFP4 --port 8000 --quantization modelopt --max-model-len 262144 --reasoning-parser qwen3. Even reducing max-model-len to 8192 I get an OOM. There is a line in the output which looks suspicious:

(EngineCore pid=40152) WARNING 07-01 13:04:02 [marlin.py:34] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.

AFAIK, RTX 5090 should support FP4. Any ideas why this fails?

darkmatter2222

about 2 hours ago

yes, the OG model does fit

christian-winkler-th

about 2 hours ago

If anybody has a working configuration for RTX 5090, could you please post it? I cannot get it working, I always get OOM (even after reading https://www.reddit.com/r/LocalLLaMA/comments/1my3why/rtx_pro_6000_maxq_blackwell_for_llm/)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment