Instructions to use mssfj/Qwen3.5-9B-GPTQ-INT8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mssfj/Qwen3.5-9B-GPTQ-INT8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="mssfj/Qwen3.5-9B-GPTQ-INT8") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("mssfj/Qwen3.5-9B-GPTQ-INT8") model = AutoModelForCausalLM.from_pretrained("mssfj/Qwen3.5-9B-GPTQ-INT8") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use mssfj/Qwen3.5-9B-GPTQ-INT8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "mssfj/Qwen3.5-9B-GPTQ-INT8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mssfj/Qwen3.5-9B-GPTQ-INT8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/mssfj/Qwen3.5-9B-GPTQ-INT8
- SGLang
How to use mssfj/Qwen3.5-9B-GPTQ-INT8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "mssfj/Qwen3.5-9B-GPTQ-INT8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mssfj/Qwen3.5-9B-GPTQ-INT8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "mssfj/Qwen3.5-9B-GPTQ-INT8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mssfj/Qwen3.5-9B-GPTQ-INT8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use mssfj/Qwen3.5-9B-GPTQ-INT8 with Docker Model Runner:
docker model run hf.co/mssfj/Qwen3.5-9B-GPTQ-INT8
Start error on vllm 0.19.x
Got error on vllm-openai (vllm 0.19.x) latest docker image. Could you give me some advice?
docker run --gpus all \
--name qwen3.5-9b
-p 8200:8000
-v /opt/models/Qwen/Qwen3.5-9B-GPTQ-INT8:/model
vllm/vllm-openai:latest
/model \
--quantization gptq_marlin
--dtype bfloat16
--kv-cache-dtype fp8
--gpu-memory-utilization 0.45
--max-num-seqs 64
--max-model-len 131072
--max-num-batched-tokens 2096
--enable-prefix-caching
--trust-remote-code
--host 0.0.0.0
--port 8000
(APIServer pid=1) INFO 04-20 07:13:26 [utils.py:299]
(APIServer pid=1) INFO 04-20 07:13:26 [utils.py:299] β β ββ ββ
(APIServer pid=1) INFO 04-20 07:13:26 [utils.py:299] ββ ββ β β β βββ β version 0.19.1
(APIServer pid=1) INFO 04-20 07:13:26 [utils.py:299] ββββ β β β β model /model
(APIServer pid=1) INFO 04-20 07:13:26 [utils.py:299] ββ βββββ βββββ β β
(APIServer pid=1) INFO 04-20 07:13:26 [utils.py:299]
(APIServer pid=1) INFO 04-20 07:13:26 [utils.py:233] non-default args: {'model_tag': '/model', 'host': '0.0.0.0', 'model': '/model', 'trust_remote_code': True, 'dtype': 'bfloat16', 'max_model_len': 131072, 'quantization': 'gptq_marlin', 'gpu_memory_utilization': 0.45, 'kv_cache_dtype': 'fp8', 'enable_prefix_caching': True, 'max_num_batched_tokens': 2096, 'max_num_seqs': 64}
(APIServer pid=1) INFO 04-20 07:13:26 [model.py:549] Resolved architecture: Qwen3_5ForCausalLM
(APIServer pid=1) WARNING 04-20 07:13:26 [model.py:2016] Casting torch.float16 to torch.bfloat16.
(APIServer pid=1) INFO 04-20 07:13:26 [model.py:1678] Using max model len 131072
(APIServer pid=1) INFO 04-20 07:13:26 [gptq_marlin.py:229] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
(APIServer pid=1) INFO 04-20 07:13:26 [cache.py:227] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=1) INFO 04-20 07:13:26 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=2096.
(APIServer pid=1) WARNING 04-20 07:13:26 [config.py:441] Mamba cache mode is set to 'align' for Qwen3_5ForCausalLM by default when prefix caching is enabled
(APIServer pid=1) INFO 04-20 07:13:26 [config.py:461] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
(APIServer pid=1) Qwen2VLImageProcessorFast is deprecated. The Fast suffix for image processors has been removed; use Qwen2VLImageProcessor instead.
(APIServer pid=1) INFO 04-20 07:13:27 [config.py:281] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1) INFO 04-20 07:13:27 [config.py:312] Padding mamba page size by 1.49% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1) INFO 04-20 07:13:27 [vllm.py:790] Asynchronous scheduling is enabled.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File "/usr/local/bin/vllm", line 10, in
(APIServer pid=1) sys.exit(main())
(APIServer pid=1) ^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1) args.dispatch_function(args)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 96, in run
(APIServer pid=1) return __asyncio.run(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1) return runner.run(main)
(APIServer pid=1) ^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 48, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 672, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 686, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 135, in init
(APIServer pid=1) self.renderer = renderer = renderer_from_config(self.vllm_config)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/renderers/registry.py", line 86, in renderer_from_config
(APIServer pid=1) return RENDERER_REGISTRY.load_renderer(renderer_mode, config, tokenizer)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/renderers/registry.py", line 68, in load_renderer
(APIServer pid=1) return renderer_cls(config, tokenizer)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/renderers/hf.py", line 612, in init
(APIServer pid=1) super().init(config, tokenizer)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/renderers/base.py", line 118, in init
(APIServer pid=1) self.mm_processor = mm_registry.create_processor(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/registry.py", line 214, in create_processor
(APIServer pid=1) return factories.build_processor(ctx, cache=cache)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/registry.py", line 95, in build_processor
(APIServer pid=1) return self.processor(info, dummy_inputs_builder, cache=cache)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/processing/processor.py", line 992, in init
(APIServer pid=1) self.data_parser = self.info.get_data_parser()
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_vl.py", line 706, in get_data_parser
(APIServer pid=1) self.get_hf_config().vision_config.spatial_merge_size,
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 110, in get_hf_config
(APIServer pid=1) return self.ctx.get_hf_config(Qwen3_5Config)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/processing/context.py", line 140, in get_hf_config
(APIServer pid=1) raise TypeError(
(APIServer pid=1) TypeError: Invalid type of HuggingFace config. Expected type: <class 'vllm.transformers_utils.configs.qwen3_5.Qwen3_5Config'>, but found type: <class 'transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig'>
Hi! Thank you for the kind words, and I'm truly sorry for the late replyβI didn't notice your comment until now!
My model is text-only, but vLLM might treat it as a multimodal model. This causes a type mismatch.
It might be a good idea to upgrade vLLM.