Instructions to use ytgui/Qwen3.5-Sonnet-9B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ytgui/Qwen3.5-Sonnet-9B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="ytgui/Qwen3.5-Sonnet-9B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("ytgui/Qwen3.5-Sonnet-9B") model = AutoModelForMultimodalLM.from_pretrained("ytgui/Qwen3.5-Sonnet-9B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use ytgui/Qwen3.5-Sonnet-9B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ytgui/Qwen3.5-Sonnet-9B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ytgui/Qwen3.5-Sonnet-9B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/ytgui/Qwen3.5-Sonnet-9B
- SGLang
How to use ytgui/Qwen3.5-Sonnet-9B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ytgui/Qwen3.5-Sonnet-9B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ytgui/Qwen3.5-Sonnet-9B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ytgui/Qwen3.5-Sonnet-9B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ytgui/Qwen3.5-Sonnet-9B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use ytgui/Qwen3.5-Sonnet-9B with Docker Model Runner:
docker model run hf.co/ytgui/Qwen3.5-Sonnet-9B
Original ninja.template provides better results
No idea why everyone is using a custom ninja file. I benchmarked your model using https://benchlocal.com/ and it is currently the best fine tuned Qwen3.5 9B a come across, but only when using the original ninja file.
With the ninja file in this repo its performance is worse.
All benchmark where performance on 2x 3090 with 250w power limit. Stock vllm (v0.21.0) with thinking disable and not MTP. Your model is fast thanks to the quant, but also because is used less tokens.
Hi,
Thanks so much for taking the time to bench the model and sharing your findings, It's great to hear it's performing well.
To clarify the ninja template situation: the only change I made was adding a default system prompt, "You are a helpful AI assistant.", to the template. No other modifications. I felt the default system prompt was worth keeping for non-technical users who may not think to set one themselves. As for why this causes a performance difference, models at this scale can sometimes be overfitted to context.
As for the score difference, a range of 74.0β75.2 honestly looks good to me either way π
That said, this is a genuinely useful discussion and I love to keep it open, I will look into whether there's a clean workaround.
Thanks.
btw bro, to the best of my knowledge the community still lacks a solid agentic coding benchmark, would that be something you'd be interested in designing?
My rough idea: pack a real git repo (e.g., sqlite, redis) into a container, strip the git history, and define realistic coding tasks like what you'd throw at claude code or opencode. would love to hear your thoughts!
Yeah both numbers 74.0 and 75.2 are great for a finetune, as it is very diffcult to improve in one area with become worse in another. While benchlocal has these 7 nice bench packs, you are totally right its missing an agentic coding benchmark. Designing a coding benchmark is pretty difficult, as there are to many programming languages. Including serveral of them would make the benchmark too big. Nevertheless docker or container in general are not my expertise, I avoid them when possible.