How to use from
vLLM
Install from pip and serve model
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "shieldstar/Qwen3.5-122B-A10B-int4-AutoRound-EC"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "shieldstar/Qwen3.5-122B-A10B-int4-AutoRound-EC",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'
Use Docker
docker model run hf.co/shieldstar/Qwen3.5-122B-A10B-int4-AutoRound-EC
Quick Links

Qwen3.5-122B-A10B-int4-AutoRound-EC

Extended Calibration (EC) INT4 AutoRound quantization of Qwen/Qwen3.5-122B-A10B, a 122B MoE (10B active) multimodal model. Drop-in replacement for Intel/Qwen3.5-122B-A10B-int4-AutoRound with wider calibration settings for improved quality on long-context and reasoning-heavy workloads.

Calibration — Extended vs Intel default

Intel (v0.12.0) EC (this model)
iters 200 400
nsamples 128 256
seqlen 512 4096
batch_size 1 8 (default)
grad_accum 8 1 (default)
ignore_layers shared_expert shared_expert
bits / group 4 / 128 4 / 128

Effective calibration batch size is 8 in both cases (1×8 vs 8×1) — mathematically equivalent signal per optimizer step, just different memory/latency profile during quantization.

Environment

Component Version
auto-round 0.12.2
transformers 5.5.3
torch 2.11.0
safetensors 0.7.0
huggingface_hub 1.10.1
Hardware RunPod H200 SXM (1x)
Wall time ~15 hrs

Files

Path What
model-000{01..13}-of-00013.* Quantized language-model shards (INT4 GPTQ, w4g128)
model_visual.safetensors Visual encoder (BF16, base-model passthrough)
model_extra_tensors.safetensors MTP (multi-token prediction) weights (BF16 passthrough)
config.json Multimodal config with embedded quantization_config
chat_template.jinja Qwen3 chat template (reasoning-enabled)

Layout note — building a custom FP8 hybrid

If you build an FP8-dense hybrid on top of this checkpoint (e.g. via albond's build-hybrid-checkpoint.py), the hybrid builder will report more unmatched FP8 tensors than on Intel's checkpoint (≈ 741 vs 408) because our visual encoder lives in a separate model_visual.safetensors file rather than inline in the main shards. The builder only scans main shards, so the FP8 visual tensors go unmatched and visual stays at BF16 in the resulting hybrid. This is functionally harmless — visual adds only ~0.9 GB at BF16 vs ~0.45 GB at FP8, no quality difference, no impact on text throughput. Pass --force to the hybrid builder to proceed.

If you only serve text (vast majority of use cases) or run vLLM against this checkpoint directly without building a hybrid, this note does not apply.

License

Apache 2.0 (inherits from Qwen/Qwen3.5-122B-A10B).

Shoutouts

  • Qwen team for the base Qwen3.5-122B-A10B model.
  • Intel for the reference AutoRound INT4 recipe and post-quant checkpoint layout we built on.
  • auto-round for the quantization tooling.
  • albond for the DGX Spark SM121 vLLM patches and hybrid INT4+FP8 serving stack.
Downloads last month
43,459
Safetensors
Model size
21B params
Tensor type
I32
·
BF16
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shieldstar/Qwen3.5-122B-A10B-int4-AutoRound-EC

Quantized
(121)
this model