Instructions to use btbtyler09/Qwen3.5-35B-A3B-GPTQ-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use btbtyler09/Qwen3.5-35B-A3B-GPTQ-4bit with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="btbtyler09/Qwen3.5-35B-A3B-GPTQ-4bit")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("btbtyler09/Qwen3.5-35B-A3B-GPTQ-4bit")
model = AutoModelForMultimodalLM.from_pretrained("btbtyler09/Qwen3.5-35B-A3B-GPTQ-4bit")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use btbtyler09/Qwen3.5-35B-A3B-GPTQ-4bit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "btbtyler09/Qwen3.5-35B-A3B-GPTQ-4bit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "btbtyler09/Qwen3.5-35B-A3B-GPTQ-4bit",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/btbtyler09/Qwen3.5-35B-A3B-GPTQ-4bit

SGLang

How to use btbtyler09/Qwen3.5-35B-A3B-GPTQ-4bit with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "btbtyler09/Qwen3.5-35B-A3B-GPTQ-4bit" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "btbtyler09/Qwen3.5-35B-A3B-GPTQ-4bit",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "btbtyler09/Qwen3.5-35B-A3B-GPTQ-4bit" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "btbtyler09/Qwen3.5-35B-A3B-GPTQ-4bit",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use btbtyler09/Qwen3.5-35B-A3B-GPTQ-4bit with Docker Model Runner:
```
docker model run hf.co/btbtyler09/Qwen3.5-35B-A3B-GPTQ-4bit
```

Qwen3.5-35B-A3B GPTQ 4-bit

GPTQ 4-bit quantization of Qwen/Qwen3.5-35B-A3B, a 35B-parameter Mixture-of-Experts (MoE) multimodal model with 3B activated parameters per token.

Includes full vision encoder and MTP (Multi-Token Prediction) module for image understanding and speculative decoding support.

Model Overview

Architecture: Qwen3_5MoeForConditionalGeneration (multimodal: text + vision)
Total parameters: ~35B
Activated parameters: ~3B per token (8 of 256 experts selected per token)
Layers: 40 (30 linear attention + 10 full attention, repeating 3:1 pattern)
Experts: 256 per layer + 1 shared expert per layer
Context length: 262,144 tokens
Vision encoder: 27-block ViT (1152 hidden, 16x16 patches), BF16
MTP module: 1-layer speculative decoding head, BF16

Quantization Details

All 30,720 MoE expert modules (256 experts x 3 projections x 40 layers) are quantized to INT4 using GPTQ. Non-expert modules (including the full vision encoder and MTP module) remain at BF16/FP16 for quality preservation.

Component	Precision	Notes
MoE experts (`gate_proj`, `up_proj`, `down_proj`)	INT4 (GPTQ)	30,720 modules quantized
Full attention (`q_proj`, `k_proj`, `v_proj`, `o_proj`)	FP16	Every 4th layer
Linear attention (`in_proj_qkv`, `in_proj_z`, `out_proj`)	FP16	Full precision
Shared experts	FP16	Full precision
Vision encoder (`model.visual.*`)	BF16	333 tensors, full precision
MTP module (`mtp.*`)	BF16	785 tensors, full precision
Embeddings, LM head, norms	FP16	Full precision

GPTQ configuration:

Bits: 4
Group size: 32
Symmetric: Yes
desc_act: No
true_sequential: Yes
act_group_aware: Yes
Failsafe: RTN for poorly-calibrated rare experts (1,350 of 30,720 modules, ~4.4%)

Calibration

Dataset: Mixed - evol-codealpaca-v1 (code) + C4 (general text)
Samples: 2,048
Quantizer: GPTQModel v5.7.1

Model Size

Version	Size	Compression
BF16 (original)	67 GB	-
GPTQ 8-bit	40 GB	1.7x
GPTQ 4-bit	25 GB	2.7x

Perplexity

Evaluated on wikitext-2-raw-v1 (test set), seq_len=2048, stride=512:

Model	Perplexity	Degradation
BF16 (original)	6.0695	-
GPTQ 8-bit	6.0748	+0.09%
GPTQ 4-bit	6.1260	+0.93%

Usage

vLLM (Recommended for Serving)

vllm serve btbtyler09/Qwen3.5-35B-A3B-GPTQ-4bit \
  --gpu-memory-utilization 0.95 \
  --max-model-len 256000 \
  --tensor-parallel-size 4 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --dtype float16 \
  --skip-mm-profiling \
  --limit-mm-per-prompt '{"image": 2}'

Parameter	Description
`--gpu-memory-utilization 0.95`	Use 95% of GPU VRAM for KV cache + weights
`--max-model-len 256000`	Full 256K context window support
`--tensor-parallel-size 4`	Shard across 4 GPUs (adjust to your setup)
`--reasoning-parser qwen3`	Enable thinking/reasoning token parsing
`--enable-auto-tool-choice --tool-call-parser qwen3_coder`	Enable tool/function calling
`--dtype float16`	Run in FP16 (required for ROCm GPTQ kernels)
`--skip-mm-profiling`	Skip multimodal memory profiling at startup
`--limit-mm-per-prompt '{"image": 2}'`	Allow up to 2 images per request

vLLM bug workaround: vLLM versions up to at least 0.15.2 have a bug in Qwen3_5MoeTextConfig where ignore_keys_at_rope_validation is defined as a list instead of a set, causing a TypeError during config parsing. Apply this fix before serving:

python3 -c "
for f in [
    '/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5_moe.py',
    '/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5.py',
]:
    t = open(f).read()
    t = t.replace(
        'ignore_keys_at_rope_validation\"] = [\n            \"mrope_section\",\n            \"mrope_interleaved\",\n        ]',
        'ignore_keys_at_rope_validation\"] = {\n            \"mrope_section\",\n            \"mrope_interleaved\",\n        }')
    open(f,'w').write(t)
    print('Patched', f)
"

Vision Example (via OpenAI API)

import base64, requests

with open("image.png", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

response = requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "btbtyler09/Qwen3.5-35B-A3B-GPTQ-4bit",
    "messages": [{"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
        {"type": "text", "text": "Describe what you see in this image."},
    ]}],
    "max_tokens": 1024,
})
print(response.json()["choices"][0]["message"]["content"])

GPTQModel / transformers

Note: Neither GPTQModel nor transformers can currently load this model directly. GPTQModel's Qwen3_5MoeGPTQ class expects the text-only weight prefix (model.layers.*) and does not support the multimodal architecture (model.language_model.layers.*). The transformers GPTQ path delegates to optimum, which does not handle the fused-expert architecture. Use vLLM for inference.

Technical Notes

Qwen3.5-35B-A3B stores MoE expert weights as fused 3D nn.Parameter tensors rather than individual nn.Linear modules. During quantization, GPTQModel's MODULE_CONVERTER_MAP converts these to individual quantizable nn.Linear layers. This same conversion must also run during model loading for the quantized kernels to be applied correctly.

The vision encoder (27-block ViT) and MTP speculative decoding module are preserved at full BF16 precision from the original model. Only the text model's MoE expert weights are quantized.

Credits

Base Model: Qwen - Qwen3.5-35B-A3B
Quantization: GPTQ via GPTQModel v5.7.1
Expert Converter: convert_qwen3_5_moe_expert_converter for fused 3D expert weights
Quantized by: btbtyler09

License

This model inherits the Apache 2.0 license from the base model.

Downloads last month: 276

Safetensors

Model size

36B params

Tensor type

I32

BF16

Model tree for btbtyler09/Qwen3.5-35B-A3B-GPTQ-4bit

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Quantized

(266)

this model