Instructions to use AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C256-P6K-CTX8K with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C256-P6K-CTX8K with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C256-P6K-CTX8K")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C256-P6K-CTX8K", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C256-P6K-CTX8K with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C256-P6K-CTX8K" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C256-P6K-CTX8K", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C256-P6K-CTX8K
- SGLang
How to use AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C256-P6K-CTX8K with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C256-P6K-CTX8K" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C256-P6K-CTX8K", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C256-P6K-CTX8K" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C256-P6K-CTX8K", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C256-P6K-CTX8K with Docker Model Runner:
docker model run hf.co/AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C256-P6K-CTX8K
Qwen3.5-0.8B
This version of Qwen3.5-0.8B has been converted to run on the Axera NPU using w4a16 quantization.
Compatible with Pulsar2 version: 5.0
Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo :
Pulsar2 Link, How to Convert LLM from Huggingface to axmodel
Support Platform
- AX650
- AX650N DEMO Board
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card
Image Process
| Chips | Input Size | Image Num | TTFT (168 tokens) | Throughput (w4a16) | CMM Memory | Flash Memory |
|---|---|---|---|---|---|---|
| AX650 | 384×384 | 1 | 250 ms | 23.8 tokens/sec | 1.09 GiB | 1.36 GiB |
Video Process
| Chips | Input Size | Image Num | TTFT (600 tokens) | Throughput (w4a16) | CMM Memory | Flash Memory |
|---|---|---|---|---|---|---|
| AX650 | 384×384 | 8 | 630 ms | 24.1 tokens/sec | 1.09 GiB | 1.36 GiB |
The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.
How to use
安装 axllm
方式一:克隆仓库后执行安装脚本:
git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh
方式二:一行命令安装(默认分支 axllm):
curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash
方式三:下载Github Actions CI 导出的可执行程序(适合没有编译环境的用户):
如果没有编译环境,请到:
https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm
下载 最新 CI 导出的可执行程序(axllm),然后:
chmod +x axllm
sudo mv axllm /usr/bin/axllm
模型下载(Hugging Face)
先创建模型目录并进入,然后下载到该目录:
mkdir -p AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C256-P6K-CTX8K
cd AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C256-P6K-CTX8K
hf download AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C256-P6K-CTX8K --local-dir .
# structure of the downloaded files
tree -L 3
`-- AXERA-TECH
`-- Qwen3.5-0.8B-AX650-GPTQ-Int4-C256-P6K-CTX8K
|-- qwen3_5_vision.axmodel
|-- README.md
|-- config.json
|-- image.png
|-- model.embed_tokens.weight.bfloat16.bin
|-- post_config.json
|-- qwen3_5_tokenizer.txt
|-- qwen3_5_text_p128_l0_together.axmodel
...
|-- qwen3_5_text_p128_l23_together.axmodel
|-- qwen3_5_text_post.axmodel
`-- vision_cache
3 directories, 39 files
Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board
运行(CLI)
这个模型是基于英文数据进行量化的,推荐使用英文进行问答
图像理解
root@ax650 ~/yongqiang/lhj/Qwen3_5.AXERA/ax-llm # axllm run Qwen3.5-0.8B-AX650-GPTQ-Int4-C256-P6K-CTX8K/
19:14:47.144 INF Init:218 | LLM init start
19:14:47.144 INF Init:226 | mixed attention enabled: full_attention_interval=4 ref_full_layer_idx=3
tokenizer_type = 3
96% | ############################## | 26 / 27 [28.70s<29.80s, 0.91 count/s] init post axmodel ok,remain_cmm(5497 MB)
19:15:15.845 INF Init:368 | max_token_len : 2047
19:15:15.845 INF Init:371 | kv_cache_size : 512, kv_cache_num: 2047
19:15:15.845 INF Init:374 | prefill_token_num : 128
19:15:15.845 INF Init:379 | grp: 1, prefill_max_kv_cache_num : 1
19:15:15.845 INF Init:379 | grp: 2, prefill_max_kv_cache_num : 128
19:15:15.845 INF Init:379 | grp: 3, prefill_max_kv_cache_num : 256
19:15:15.845 INF Init:379 | grp: 4, prefill_max_kv_cache_num : 384
19:15:15.845 INF Init:379 | grp: 5, prefill_max_kv_cache_num : 512
19:15:15.845 INF Init:379 | grp: 6, prefill_max_kv_cache_num : 768
19:15:15.845 INF Init:379 | grp: 7, prefill_max_kv_cache_num : 896
19:15:15.845 INF Init:379 | grp: 8, prefill_max_kv_cache_num : 1024
19:15:15.845 INF Init:379 | grp: 9, prefill_max_kv_cache_num : 1152
19:15:15.845 INF Init:384 | prefill_max_token_num : 1152
19:15:15.845 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ | 27 / 27 [28.71s<28.71s, 0.94 count/s] embed_selector init ok
19:15:17.168 INF Init:643 | Qwen-VL token ids: vision_start=248053 image_pad=248056 video_pad=248057
19:15:17.168 INF Init:668 | VisionModule init ok: type=Qwen3VL, tokens_per_block=144, embed_size=1024, out_dtype=fp32
19:15:17.168 WRN Init:677 | Vision preprocess backend: SimpleCV (OpenCV not found at build time; minor differences vs OpenCV are possible)
19:15:17.173 INF load_config:282 | load config:
19:15:17.173 INF load_config:282 | {
19:15:17.173 INF load_config:282 | "enable_repetition_penalty": false,
19:15:17.173 INF load_config:282 | "enable_temperature": false,
19:15:17.173 INF load_config:282 | "enable_top_k_sampling": true,
19:15:17.173 INF load_config:282 | "enable_top_p_sampling": false,
19:15:17.173 INF load_config:282 | "penalty_window": 20,
19:15:17.173 INF load_config:282 | "repetition_penalty": 1.2,
19:15:17.173 INF load_config:282 | "temperature": 0.9,
19:15:17.173 INF load_config:282 | "top_k": 10,
19:15:17.173 INF load_config:282 | "top_p": 0.8
19:15:17.173 INF load_config:282 | }
19:15:17.173 INF Init:448 | LLM init ok
Commands:
/q, /exit 退出
/reset 重置 kvcache
/dd 删除一轮对话
/pp 打印历史对话
Ctrl+C: 停止当前生成
VLM enabled: after each prompt, input image path (empty = text-only). Use "video:<frames_dir>" for video.
----------------------------------------
prompt >> describe this image
image >> image.png
15:52:39.666 INF EncodeForContent:919 | vision cache hit (disk): image.png
15:52:39.666 INF EncodeForContent:928 | vision cache hit (mem): image.png
15:52:39.669 INF SetKVCache:747 | prefill_grpid:3 kv_cache_num:256 precompute_len:0 input_num_token:168
15:52:39.669 INF SetKVCache:749 | current prefill_max_token_num:1152
15:52:39.669 INF SetKVCache:750 | first run
15:52:39.718 INF Run:805 | input token num : 168, prefill_split_num : 2
15:52:39.718 INF Run:845 | prefill chunk p=0 history_len=0 grpid=1 kv_cache_num=0 input_tokens=128
15:52:39.718 INF Run:868 | prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=3
15:52:39.833 INF Run:845 | prefill chunk p=1 history_len=128 grpid=2 kv_cache_num=128 input_tokens=40
15:52:39.833 INF Run:868 | prefill indices shape: p=1 idx_elems=128 idx_rows=1 pos_rows=3
15:52:39.968 INF Run:1010 | ttft: 249.97 ms
<think>
</think>
This image captures three astronauts in space, set against a forest backdrop that resembles a bamboo grove. The lighting is stark and monochromatic, giving the scene a surreal or high-contrast aesthetic.
- **The Astronaut:** In the foreground sits a tall astronaut, viewed from the front. This individual has long blonde hair and is wearing an all-white, puffy space suit with a dark helmet. Their stance is upright, suggesting they are standing in front of the camera or in the scene itself.
- **Background:** Behind the astronaut, the scene transitions into a dense forest. To the immediate left, the foliage is out of focus, creating a sense of depth. The background is filled with tall, thin bamboo stalks and their feathery fronds, which dominate the upper half of the frame. The right half of the background is also mostly obscured by the dense foliage. The entire image is rendered in grayscale with some bright white highlights on the left edge of the frame and a dark, shadowy left edge of the astronaut's head.
Overall, the composition creates a sense of vastness and scale through the large amount of foliage. The bright white highlights are likely from the sunlight hitting the left edge of the frame, which contrasts with the dark shadow on the astronaut.
15:52:51.081 NTC Run:1132 | hit eos,avg 23.76 token/s
15:52:51.081 INF GetKVCache:721 | precompute_len:300, remaining:852
视频理解
prompt >> 视频中有几个进球
media >> video:assets/football.mp4:1
17:10:33.416 INF extract_video_frames_ffmpeg:299 | Extracting raw video container to frames: football.mp4 -> /tmp/axllm_video_frames/video_74679077807123_0 fps=0.99900099900099892
17:10:49.837 INF collect_video_frame_paths:379 | Video fps sampling: path=football.mp4 fps=1 duration=60.060s target_frames=60 selected=60
17:11:00.438 INF SetKVCache:2543 | decode_grpid:2 prefill_grpid:4 history_cap:0 total_cap:256 symbolic_cap:1 precompute_len:0 input_num_token:4345 prefer_symbolic_group:0
17:11:00.438 INF SetKVCache:2565 | current prefill_max_token_num:6400
17:11:00.537 INF SetKVCache:2581 | first run
17:11:00.557 INF Run:2738 | input token num : 4345, prefill_split_num : 17
17:11:00.557 INF Run:2818 | prefill chunk p=0 history_len=0 grpid=4 kv_cache_num=0 input_tokens=256
17:11:00.783 INF Run:2818 | prefill chunk p=1 history_len=256 grpid=6 kv_cache_num=512 input_tokens=256
17:11:01.034 INF Run:2818 | prefill chunk p=2 history_len=512 grpid=7 kv_cache_num=768 input_tokens=256
17:11:01.293 INF Run:2818 | prefill chunk p=3 history_len=768 grpid=8 kv_cache_num=1024 input_tokens=256
17:11:01.564 INF Run:2818 | prefill chunk p=4 history_len=1024 grpid=9 kv_cache_num=1280 input_tokens=256
17:11:01.837 INF Run:2818 | prefill chunk p=5 history_len=1280 grpid=10 kv_cache_num=1536 input_tokens=256
17:11:02.128 INF Run:2818 | prefill chunk p=6 history_len=1536 grpid=11 kv_cache_num=1792 input_tokens=256
17:11:02.431 INF Run:2818 | prefill chunk p=7 history_len=1792 grpid=12 kv_cache_num=2048 input_tokens=256
17:11:02.746 INF Run:2818 | prefill chunk p=8 history_len=2048 grpid=13 kv_cache_num=2304 input_tokens=256
17:11:03.057 INF Run:2818 | prefill chunk p=9 history_len=2304 grpid=14 kv_cache_num=2560 input_tokens=256
17:11:03.371 INF Run:2818 | prefill chunk p=10 history_len=2560 grpid=15 kv_cache_num=2816 input_tokens=256
17:11:03.688 INF Run:2818 | prefill chunk p=11 history_len=2816 grpid=16 kv_cache_num=3072 input_tokens=256
17:11:04.009 INF Run:2818 | prefill chunk p=12 history_len=3072 grpid=17 kv_cache_num=3328 input_tokens=256
17:11:04.331 INF Run:2818 | prefill chunk p=13 history_len=3328 grpid=18 kv_cache_num=3584 input_tokens=256
17:11:04.661 INF Run:2818 | prefill chunk p=14 history_len=3584 grpid=19 kv_cache_num=3840 input_tokens=256
17:11:04.998 INF Run:2818 | prefill chunk p=15 history_len=3840 grpid=20 kv_cache_num=4096 input_tokens=256
17:11:05.343 INF Run:2818 | prefill chunk p=16 history_len=4096 grpid=21 kv_cache_num=4352 input_tokens=249
17:11:05.734 INF Run:3045 | ttft: 5176.28 ms
<think>17:11:05.734 INF Run:3076 | VLM decode positions: rope_start=4213 dense_kv_start=4345
</think>
根据提供的视频画面,共有 **2** 个进球。
- **进球 1**:在视频前半段,一名身穿白色球衣的球员带球突破,将球踢向对方球门,但被守门员扑出。
- **进球 2**:在视频后半段,一名身穿白色球衣的球员在禁区外完成射门,球入网。
17:11:09.920 NTC Run:3445 | hit eos,decode avg 19.35 token/s
启动服务(OpenAI 兼容)
root@ax650:~# axllm serve AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C256-P6K-CTX8K
[I][ Init][ 138]: LLM init start
tokenizer_type = 1
96% | ███████████████████████████████ | 30 / 31 [4.63s<4.79s, 6.47 count/s] init post axmodel ok,remain_cmm(9563 MB)
[I][ Init][ 199]: max_token_len : 2047
[I][ Init][ 202]: kv_cache_size : 1024, kv_cache_num: 2047
[I][ Init][ 205]: prefill_token_num : 128
[I][ Init][ 209]: grp: 1, prefill_max_kv_cache_num : 1
[I][ Init][ 209]: grp: 2, prefill_max_kv_cache_num : 128
[I][ Init][ 209]: grp: 3, prefill_max_kv_cache_num : 256
[I][ Init][ 209]: grp: 4, prefill_max_kv_cache_num : 384
[I][ Init][ 209]: grp: 5, prefill_max_kv_cache_num : 512
[I][ Init][ 209]: grp: 6, prefill_max_kv_cache_num : 640
[I][ Init][ 209]: grp: 7, prefill_max_kv_cache_num : 768
[I][ Init][ 209]: grp: 8, prefill_max_kv_cache_num : 896
[I][ Init][ 209]: grp: 9, prefill_max_kv_cache_num : 1024
[I][ Init][ 209]: grp: 10, prefill_max_kv_cache_num : 1152
[I][ Init][ 214]: prefill_max_token_num : 1152
[I][ Init][ 27]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ | 31 / 31 [4.64s<4.64s, 6.69 count/s] embed_selector init ok
[W][ Init][ 457]: Qwen-VL vision size override: cfg=448x448 bytes=1204224, model_input_bytes=884736 -> 384x384 (square).
[I][ Init][ 641]: Qwen-VL token ids: vision_start=151652 image_pad=151655 video_pad=151656
[I][ Init][ 666]: VisionModule init ok: type=Qwen3VL, tokens_per_block=144, embed_size=2048, out_dtype=fp32
[I][ Init][ 672]: VisionModule deepstack enabled: layers=3
[I][ load_config][ 282]: load config:
{
"enable_repetition_penalty": false,
"enable_temperature": false,
"enable_top_k_sampling": false,
"enable_top_p_sampling": false,
"penalty_window": 20,
"repetition_penalty": 1.2,
"temperature": 0.9,
"top_k": 10,
"top_p": 0.8
}
[I][ Init][ 272]: LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C256-P6K-CTX8K'...
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C256-P6K-CTX8K
OpenAI 调用示例
from openai import OpenAI
API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C256-P6K-CTX8K"
messages = [
{"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
{"role": "user", "content": "hello"},
]
client = OpenAI(api_key="not-needed", base_url=API_URL)
completion = client.chat.completions.create(
model=MODEL,
messages=messages,
)
print(completion.choices[0].message.content)
OpenAI 流式调用示例
from openai import OpenAI
API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C256-P6K-CTX8K"
messages = [
{"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
{"role": "user", "content": "hello"},
]
client = OpenAI(api_key="not-needed", base_url=API_URL)
stream = client.chat.completions.create(
model=MODEL,
messages=messages,
stream=True,
)
print("assistant:")
for ev in stream:
delta = getattr(ev.choices[0], "delta", None)
if delta and getattr(delta, "content", None):
print(delta.content, end="", flush=True)
print("
")
- Downloads last month
- 21