Instructions to use shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed") model = AutoModelForImageTextToText.from_pretrained("shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed
- SGLang
How to use shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed with Docker Model Runner:
docker model run hf.co/shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed
Qwen3.6-35B-A3B PARO full4096-e5 — packed format
This is the packed ParoQuant export for Qwen/Qwen3.6-35B-A3B, using the full4096-e5 calibration run.
The packed artifact was produced from the legacy/original export with:
python3 scripts/strip_paro_safetensors.py \
--input-dir /models/qwen36-quant/Qwen3.6-35B-A3B-PARO-full4096-e5 \
--output-dir /models/qwen36-quant/Qwen3.6-35B-A3B-PARO-full4096-e5-packed \
--mode packed
Packed changes:
- Removed every duplicate fp16
.weightfallback tensor where the same module has.qweight - Removed tensors: 250
- Removed tensor bytes: 2,810,183,680
model.safetensors: 20,474,495,512 bytes- Actual packed BPW: 4.6799 using a 35B denominator
- Verified duplicate fallback count after stripping: 0
The legacy/original-format export is available separately at:
Quality and size comparison
Canonical tx4/quality3 evaluation compares each candidate directly against the original BF16 HF model on the same scored token positions.
| Model | Format | Size GiB ↓ | BPW ↓ | PPL ↓ | ΔNLL ↓ | KL nats ↓ | Top-1 % ↑ |
|---|---|---|---|---|---|---|---|
| Original BF16 HF | HF safetensors | 66.966 | 16.435 | 6.5590 | +0.000000 | 0.000000 | 100.000 |
| PARO full4096-e5 packed (this repo) | packed safetensors | 19.068 | 4.680 | 6.6216 | +0.009506 | 0.034684 | 92.000 |
| PARO full4096-e5 unpacked/original-format | legacy safetensors | 21.686 | 5.322 | 6.6216 | +0.009506 | 0.034684 | 92.000 |
| GGUF UD-Q4_K_S | GGUF | 19.458 | 4.776 | 6.5783 | +0.003842 | 0.012800 | 94.999 |
| GGUF UD-Q4_K_M | GGUF | 20.614 | 5.059 | 6.5643 | +0.001718 | 0.010849 | 95.354 |
Column calculations:
- Size GiB: active model weight artifact bytes divided by
2^30; for HF/PARO this sums.safetensorsweight files only, and for GGUF this is the.gguffile size. - BPW: active model weight artifact bytes × 8 /
35,000,000,000; the fixed 35B denominator is used only to make formats directly comparable. - PPL:
exp(mean NLL)over the canonical scored target positions only. Evaluation source: tx4/quality3 validation set,ctx=2048,stride=1023,127windows,129,921scored tokens. - ΔNLL: candidate mean NLL minus original BF16 HF mean NLL; positive values mean worse true-token likelihood than the original.
- KL nats: mean
KL(P_original_BF16_HF || P_candidate)over the full next-token distribution; lower is better. - Top-1 %: percentage of scored positions where the candidate and original BF16 HF choose the same argmax next token; higher is better.
- Packed and unpacked PARO quality metrics are identical because the packed artifact removes duplicate fp16 fallback tensors only; the quantized tensors are unchanged.
- GGUF rows were evaluated with llama.cpp using BF16 HF reference log-probabilities streamed in llama.cpp KLD format.
Notes
This artifact requires a packed-aware ParoQuant-compatible loader/runtime; legacy loaders that expect duplicate fp16 fallback .weight tensors will not load this format.
See strip_paro_safetensors_report.json for the exact stripping report.
- Downloads last month
- 74
Model tree for shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed
Base model
Qwen/Qwen3.6-35B-A3B
docker model run hf.co/shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed