How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf TirGun/Qwen3.5-4B-GGUF:
# Run inference directly in the terminal:
llama-cli -hf TirGun/Qwen3.5-4B-GGUF:
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf TirGun/Qwen3.5-4B-GGUF:
# Run inference directly in the terminal:
llama-cli -hf TirGun/Qwen3.5-4B-GGUF:
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf TirGun/Qwen3.5-4B-GGUF:
# Run inference directly in the terminal:
./llama-cli -hf TirGun/Qwen3.5-4B-GGUF:
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf TirGun/Qwen3.5-4B-GGUF:
# Run inference directly in the terminal:
./build/bin/llama-cli -hf TirGun/Qwen3.5-4B-GGUF:
Use Docker
docker model run hf.co/TirGun/Qwen3.5-4B-GGUF:
Quick Links

Qwen 3.5 4B GGUF

Description

This repository contains GGUF weights for the Qwen/Qwen3.5-4B model. The files were converted and quantized using llama.cpp.

Provided Files

  • Q6_K: High quality, recommended for best performance if you have enough VRAM.
  • Q5_K_M: Balanced quality and speed.
  • Q4_K_M: Optimal for most users, fast and lightweight.

Usage

You can run these models using llama.cpp or any GGUF-compatible software like LM Studio, Ollama, or KoboldCPP.

Example command for llama-cli:

./llama-cli -m qwen3.5-4b-Q4_K_M.gguf -ngl 32

Example PowerShell command for llama-cli:

.\llama-cli.exe -m qwen3.5-4b-Q4_K_M.gguf -ngl 32 -fa 0 --no-mmap --reasoning off 

Parameter Quick Reference (CLI Flags)

When running this model via llama-cli, you can use the following flags to optimize performance:

Flash Attention (-fa)

An optimization technique for the attention mechanism.

  • -fa 1: Enable. Significantly speeds up processing for long contexts (requires model and hardware support).
  • -fa 0: Disable. More stable, but slower when dealing with large contexts.

Memory Mapping (--no-mmap)

Controls how the model file is loaded into the system.

  • Without this flag: The model uses mmap (memory-mapped files) by default. It provides faster loading but may occasionally conflict with specific systems or GPU drivers.
  • With --no-mmap: The model is fully read into system RAM. This is more reliable for troubleshooting but results in slower startup times and higher RAM consumption.

Reasoning Process (--reasoning)

Controls the output of the model's internal "thinking" (for models trained with reasoning capabilities like Qwen 3.5).

  • --reasoning on: Allows the model to display its internal thought process (usually enclosed within <thought> tags).
  • --reasoning off: Disables the thought process output, forcing the model to provide a direct answer immediately.
Downloads last month
711
GGUF
Model size
4B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TirGun/Qwen3.5-4B-GGUF

Finetuned
Qwen/Qwen3.5-4B
Quantized
(241)
this model