How to use from
SGLang
Install from pip and serve model
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'
Use Docker images
docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'
Quick Links

Qwen3.5-0.8B

This version of Qwen3.5-0.8B has been converted to run on the Axera NPU using w4a16 quantization.

Compatible with Pulsar2 version: 5.0

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo :

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU HOST LLM Runtime

Support Platform

Image Process

Chips Input Size Image Num TTFT (168 tokens) Throughput (w4a16) CMM Memory Flash Memory
AX650 384×384 1 250 ms 23.8 tokens/sec 0.94 GiB 1.31 GiB

Video Process

Chips Input Size Image Num TTFT (600 tokens) Throughput (w4a16) CMM Memory Flash Memory
AX650 384×384 8 630 ms 24.1 tokens/sec 0.94 GiB 1.31 GiB

The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.

How to use

安装 axllm

方式一:克隆仓库后执行安装脚本:

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

方式二:一行命令安装(默认分支 axllm):

curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash

方式三:下载Github Actions CI 导出的可执行程序(适合没有编译环境的用户):

如果没有编译环境,请到: https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm 下载 最新 CI 导出的可执行程序axllm),然后:

chmod +x axllm
sudo mv axllm /usr/bin/axllm

模型下载(Hugging Face)

先创建模型目录并进入,然后下载到该目录:

mkdir -p AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047
cd AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047
hf download AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047 --local-dir .

# structure of the downloaded files
tree -L 3
`-- AXERA-TECH
    `-- Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047
        |-- qwen3_5_vision.axmodel
        |-- README.md
        |-- config.json
        |-- image.png
        |-- model.embed_tokens.weight.bfloat16.bin
        |-- post_config.json
        |-- qwen3_5_tokenizer.txt
        |-- qwen3_5_text_p128_l0_together.axmodel
        ...
        |-- qwen3_5_text_p128_l23_together.axmodel
        |-- qwen3_5_text_post.axmodel
        `-- vision_cache

3 directories, 39 files

Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

运行(CLI)

这个模型是基于英文数据进行量化的,推荐使用英文进行问答

图像理解

root@ax650 ~/yongqiang/lhj/Qwen3_5.AXERA/ax-llm # axllm run Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047/
19:14:47.144 INF Init:218 | LLM init start
19:14:47.144 INF Init:226 | mixed attention enabled: full_attention_interval=4 ref_full_layer_idx=3
tokenizer_type = 3
 96% | ##############################   |  26 /  27 [28.70s<29.80s, 0.91 count/s] init post axmodel ok,remain_cmm(5497 MB)
19:15:15.845 INF Init:368 | max_token_len : 2047
19:15:15.845 INF Init:371 | kv_cache_size : 512, kv_cache_num: 2047
19:15:15.845 INF Init:374 | prefill_token_num : 128
19:15:15.845 INF Init:379 | grp: 1, prefill_max_kv_cache_num : 1
19:15:15.845 INF Init:379 | grp: 2, prefill_max_kv_cache_num : 128
19:15:15.845 INF Init:379 | grp: 3, prefill_max_kv_cache_num : 256
19:15:15.845 INF Init:379 | grp: 4, prefill_max_kv_cache_num : 384
19:15:15.845 INF Init:379 | grp: 5, prefill_max_kv_cache_num : 512
19:15:15.845 INF Init:379 | grp: 6, prefill_max_kv_cache_num : 768
19:15:15.845 INF Init:379 | grp: 7, prefill_max_kv_cache_num : 896
19:15:15.845 INF Init:379 | grp: 8, prefill_max_kv_cache_num : 1024
19:15:15.845 INF Init:379 | grp: 9, prefill_max_kv_cache_num : 1152
19:15:15.845 INF Init:384 | prefill_max_token_num : 1152
19:15:15.845 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ |  27 /  27 [28.71s<28.71s, 0.94 count/s] embed_selector init ok
19:15:17.168 INF Init:643 | Qwen-VL token ids: vision_start=248053 image_pad=248056 video_pad=248057
19:15:17.168 INF Init:668 | VisionModule init ok: type=Qwen3VL, tokens_per_block=144, embed_size=1024, out_dtype=fp32
19:15:17.168 WRN Init:677 | Vision preprocess backend: SimpleCV (OpenCV not found at build time; minor differences vs OpenCV are possible)
19:15:17.173 INF load_config:282 | load config: 
19:15:17.173 INF load_config:282 | {
19:15:17.173 INF load_config:282 |     "enable_repetition_penalty": false,
19:15:17.173 INF load_config:282 |     "enable_temperature": false,
19:15:17.173 INF load_config:282 |     "enable_top_k_sampling": true,
19:15:17.173 INF load_config:282 |     "enable_top_p_sampling": false,
19:15:17.173 INF load_config:282 |     "penalty_window": 20,
19:15:17.173 INF load_config:282 |     "repetition_penalty": 1.2,
19:15:17.173 INF load_config:282 |     "temperature": 0.9,
19:15:17.173 INF load_config:282 |     "top_k": 10,
19:15:17.173 INF load_config:282 |     "top_p": 0.8
19:15:17.173 INF load_config:282 | }
19:15:17.173 INF Init:448 | LLM init ok
Commands:
  /q, /exit  退出
  /reset     重置 kvcache
  /dd        删除一轮对话
  /pp        打印历史对话
Ctrl+C: 停止当前生成
VLM enabled: after each prompt, input image path (empty = text-only). Use "video:<frames_dir>" for video.
----------------------------------------
prompt >> describe this image
image >> image.png
15:52:39.666 INF EncodeForContent:919 | vision cache hit (disk): image.png
15:52:39.666 INF EncodeForContent:928 | vision cache hit (mem): image.png
15:52:39.669 INF SetKVCache:747 | prefill_grpid:3 kv_cache_num:256 precompute_len:0 input_num_token:168
15:52:39.669 INF SetKVCache:749 | current prefill_max_token_num:1152
15:52:39.669 INF SetKVCache:750 | first run
15:52:39.718 INF Run:805 | input token num : 168, prefill_split_num : 2
15:52:39.718 INF Run:845 | prefill chunk p=0 history_len=0 grpid=1 kv_cache_num=0 input_tokens=128
15:52:39.718 INF Run:868 | prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=3
15:52:39.833 INF Run:845 | prefill chunk p=1 history_len=128 grpid=2 kv_cache_num=128 input_tokens=40
15:52:39.833 INF Run:868 | prefill indices shape: p=1 idx_elems=128 idx_rows=1 pos_rows=3
15:52:39.968 INF Run:1010 | ttft: 249.97 ms
<think>

</think>

This image captures three astronauts in space, set against a forest backdrop that resembles a bamboo grove. The lighting is stark and monochromatic, giving the scene a surreal or high-contrast aesthetic.

- **The Astronaut:** In the foreground sits a tall astronaut, viewed from the front. This individual has long blonde hair and is wearing an all-white, puffy space suit with a dark helmet. Their stance is upright, suggesting they are standing in front of the camera or in the scene itself.
- **Background:** Behind the astronaut, the scene transitions into a dense forest. To the immediate left, the foliage is out of focus, creating a sense of depth. The background is filled with tall, thin bamboo stalks and their feathery fronds, which dominate the upper half of the frame. The right half of the background is also mostly obscured by the dense foliage. The entire image is rendered in grayscale with some bright white highlights on the left edge of the frame and a dark, shadowy left edge of the astronaut's head.

Overall, the composition creates a sense of vastness and scale through the large amount of foliage. The bright white highlights are likely from the sunlight hitting the left edge of the frame, which contrasts with the dark shadow on the astronaut.

15:52:51.081 NTC Run:1132 | hit eos,avg 23.76 token/s
15:52:51.081 INF GetKVCache:721 | precompute_len:300, remaining:852

视频理解

root@ax650 ~/yongqiang/lhj/huggingface/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047 # ./axllm run ./
15:57:09.147 INF Init:218 | LLM init start
15:57:09.147 INF Init:226 | mixed attention enabled: full_attention_interval=4 ref_full_layer_idx=3
tokenizer_type = 3
  3% | #                                |   1 /  27 [1.28s<34.70s, 0.78 count/s] 99s<4.15s, 6.51 count/s] init post axmodel ok,remain_cmm(5877 MB)
15:57:13.141 INF Init:368 | max_token_len : 2047
15:57:13.141 INF Init:371 | kv_cache_size : 512, kv_cache_num: 2047
15:57:13.141 INF Init:374 | prefill_token_num : 128
15:57:13.141 INF Init:379 | grp: 1, prefill_max_kv_cache_num : 1
15:57:13.141 INF Init:379 | grp: 2, prefill_max_kv_cache_num : 128
15:57:13.141 INF Init:379 | grp: 3, prefill_max_kv_cache_num : 256
15:57:13.141 INF Init:379 | grp: 4, prefill_max_kv_cache_num : 384
15:57:13.141 INF Init:379 | grp: 5, prefill_max_kv_cache_num : 512
15:57:13.141 INF Init:379 | grp: 6, prefill_max_kv_cache_num : 640
15:57:13.141 INF Init:379 | grp: 7, prefill_max_kv_cache_num : 768
15:57:13.141 INF Init:379 | grp: 8, prefill_max_kv_cache_num : 896
15:57:13.141 INF Init:379 | grp: 9, prefill_max_kv_cache_num : 1024
15:57:13.141 INF Init:379 | grp: 10, prefill_max_kv_cache_num : 1152
15:57:13.141 INF Init:384 | prefill_max_token_num : 1152
15:57:13.141 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ |  27 /  27 [4.00s<4.00s, 6.76 count/s] embed_selector init ok
15:57:13.241 INF Init:643 | Qwen-VL token ids: vision_start=248053 image_pad=248056 video_pad=248057
15:57:13.241 INF Init:668 | VisionModule init ok: type=Qwen3VL, tokens_per_block=144, embed_size=1024, out_dtype=fp32
15:57:13.241 WRN Init:677 | Vision preprocess backend: SimpleCV (OpenCV not found at build time; minor differences vs OpenCV are possible)
15:57:13.243 INF load_config:282 | load config: 
15:57:13.243 INF load_config:282 | {
15:57:13.243 INF load_config:282 |     "enable_repetition_penalty": false,
15:57:13.243 INF load_config:282 |     "enable_temperature": false,
15:57:13.243 INF load_config:282 |     "enable_top_k_sampling": true,
15:57:13.243 INF load_config:282 |     "enable_top_p_sampling": false,
15:57:13.243 INF load_config:282 |     "penalty_window": 20,
15:57:13.243 INF load_config:282 |     "repetition_penalty": 1.2,
15:57:13.243 INF load_config:282 |     "temperature": 0.9,
15:57:13.243 INF load_config:282 |     "top_k": 10,
15:57:13.243 INF load_config:282 |     "top_p": 0.8
15:57:13.243 INF load_config:282 | }
15:57:13.243 INF Init:448 | LLM init ok
Commands:
  /q, /exit  退出
  /reset     重置 kvcache
  /dd        删除一轮对话
  /pp        打印历史对话
Ctrl+C: 停止当前生成
VLM enabled: after each prompt, input image path (empty = text-only). Use "video:<frames_dir>" for video.
----------------------------------------
prompt >> describe the video
image >> video:video-test-03
15:57:30.531 INF SetKVCache:747 | prefill_grpid:6 kv_cache_num:640 precompute_len:0 input_num_token:600
15:57:30.531 INF SetKVCache:749 | current prefill_max_token_num:1152
15:57:30.531 INF SetKVCache:750 | first run
15:57:30.532 INF Run:805 | input token num : 600, prefill_split_num : 5
15:57:30.532 INF Run:845 | prefill chunk p=0 history_len=0 grpid=1 kv_cache_num=0 input_tokens=128
15:57:30.532 INF Run:868 | prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=3
15:57:30.652 INF Run:845 | prefill chunk p=1 history_len=128 grpid=2 kv_cache_num=128 input_tokens=128
15:57:30.653 INF Run:868 | prefill indices shape: p=1 idx_elems=128 idx_rows=1 pos_rows=3
15:57:30.776 INF Run:845 | prefill chunk p=2 history_len=256 grpid=3 kv_cache_num=256 input_tokens=128
15:57:30.777 INF Run:868 | prefill indices shape: p=2 idx_elems=128 idx_rows=1 pos_rows=3
15:57:30.900 INF Run:845 | prefill chunk p=3 history_len=384 grpid=4 kv_cache_num=384 input_tokens=128
15:57:30.900 INF Run:868 | prefill indices shape: p=3 idx_elems=128 idx_rows=1 pos_rows=3
15:57:31.024 INF Run:845 | prefill chunk p=4 history_len=512 grpid=5 kv_cache_num=512 input_tokens=88
15:57:31.024 INF Run:868 | prefill indices shape: p=4 idx_elems=128 idx_rows=1 pos_rows=3
15:57:31.162 INF Run:1010 | ttft: 630.76 ms
<think>

</think>

The video opens with a woman standing in a modern kitchen. She's wearing an apron with bird designs in pink and orange, suggesting she is either an intern or part-time worker. She is facing the camera with her back to the viewer, standing near a stove.

The focus then shifts to the oven, where a person is using an oven handle to open the door. Smoke or steam is visibly rising from the oven. The scene is slightly obscured by smoke, with some text overlays in pink and white at the top and bottom right of the frame, including social media handles and the "INSIDE EDITION" logo, which suggests this is promotional content. The overall atmosphere conveys a sense of activity inside a home kitchen, possibly preparing a meal. The scene is bright, with the oven being the central point of attention, indicating an active cooking scenario.

15:57:38.505 NTC Run:1132 | hit eos,avg 24.11 token/s
15:57:38.505 INF GetKVCache:721 | precompute_len:645, remaining:507

启动服务(OpenAI 兼容)

root@ax650:~# axllm serve AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047
[I][                            Init][ 138]: LLM init start
tokenizer_type = 1
 96% | ███████████████████████████████   |  30 /  31 [4.63s<4.79s, 6.47 count/s] init post axmodel ok,remain_cmm(9563 MB)
[I][                            Init][ 199]: max_token_len : 2047
[I][                            Init][ 202]: kv_cache_size : 1024, kv_cache_num: 2047
[I][                            Init][ 205]: prefill_token_num : 128
[I][                            Init][ 209]: grp: 1, prefill_max_kv_cache_num : 1
[I][                            Init][ 209]: grp: 2, prefill_max_kv_cache_num : 128
[I][                            Init][ 209]: grp: 3, prefill_max_kv_cache_num : 256
[I][                            Init][ 209]: grp: 4, prefill_max_kv_cache_num : 384
[I][                            Init][ 209]: grp: 5, prefill_max_kv_cache_num : 512
[I][                            Init][ 209]: grp: 6, prefill_max_kv_cache_num : 640
[I][                            Init][ 209]: grp: 7, prefill_max_kv_cache_num : 768
[I][                            Init][ 209]: grp: 8, prefill_max_kv_cache_num : 896
[I][                            Init][ 209]: grp: 9, prefill_max_kv_cache_num : 1024
[I][                            Init][ 209]: grp: 10, prefill_max_kv_cache_num : 1152
[I][                            Init][ 214]: prefill_max_token_num : 1152
[I][                            Init][  27]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  31 /  31 [4.64s<4.64s, 6.69 count/s] embed_selector init ok
[W][                            Init][ 457]: Qwen-VL vision size override: cfg=448x448 bytes=1204224, model_input_bytes=884736 -> 384x384 (square).
[I][                            Init][ 641]: Qwen-VL token ids: vision_start=151652 image_pad=151655 video_pad=151656
[I][                            Init][ 666]: VisionModule init ok: type=Qwen3VL, tokens_per_block=144, embed_size=2048, out_dtype=fp32
[I][                            Init][ 672]: VisionModule deepstack enabled: layers=3
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": false,
    "enable_top_k_sampling": false,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 272]: LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047'...
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047

OpenAI 调用示例

from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
completion = client.chat.completions.create(
    model=MODEL,
    messages=messages,
)

print(completion.choices[0].message.content)

OpenAI 流式调用示例

from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
stream = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    stream=True,
)

print("assistant:")
for ev in stream:
    delta = getattr(ev.choices[0], "delta", None)
    if delta and getattr(delta, "content", None):
        print(delta.content, end="", flush=True)
print("
")
Downloads last month
24
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047

Finetuned
(221)
this model

Collection including AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047