Qwen3.5-2B

This version of Qwen3.5-2B has been converted to run on the Axera NPU using w8a16 quantization.

Compatible with Pulsar2 version: 5.0

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo :

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU HOST LLM Runtime

Support Platform

Image Process

Chips input size image num ttft(168 tokens) w8a16 CMM Flash
AX650 384*384 1 460 ms 9.0 tokens/sec 2.4GiB 3.46GiB

Video Process

Chips input size image num ttft(600 tokens) w8a16 CMM Flash
AX650 384*384 8 1130 ms 9.0 tokens/sec 2.4GiB 3.46GiB

The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.

How to use

安装 axllm

方式一:克隆仓库后执行安装脚本:

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

方式二:一行命令安装(默认分支 axllm):

curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash

方式三:下载Github Actions CI 导出的可执行程序(适合没有编译环境的用户):

如果没有编译环境,请到: https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm 下载 最新 CI 导出的可执行程序axllm),然后:

chmod +x axllm
sudo mv axllm /usr/bin/axllm

模型下载(Hugging Face)

先创建模型目录并进入,然后下载到该目录:

mkdir -p AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047
cd AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047
hf download AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047 --local-dir .

# structure of the downloaded files
tree -L 3
`-- AXERA-TECH
    `-- Qwen3.5-2B-AX650-C128-P1152-CTX2047
        |-- qwen3_5_vision.axmodel
        |-- README.md
        |-- config.json
        |-- image.png
        |-- model.embed_tokens.weight.bfloat16.bin
        |-- post_config.json
        |-- qwen3_5_tokenizer.txt
        |-- qwen3_5_text_p128_l0_together.axmodel
        ...
        |-- qwen3_5_text_p128_l23_together.axmodel
        |-- qwen3_5_text_post.axmodel
        `-- vision_cache

3 directories, 39 files

Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

运行(CLI)

root@ax650 ~/Qwen3.5-2B-AX650-C128-P1152-CTX2047 # ./axllm run ./
12:09:43.753 INF Init:218 | LLM init start
12:09:43.753 INF Init:226 | mixed attention enabled: full_attention_interval=4 ref_full_layer_idx=3
tokenizer_type = 3
 96% | ##############################   |  26 /  27 [33.16s<34.43s, 0.78 count/s] init post axmodel ok,remain_cmm(5470 MB)
12:10:16.913 INF Init:368 | max_token_len : 2047
12:10:16.913 INF Init:371 | kv_cache_size : 512, kv_cache_num: 2047
12:10:16.913 INF Init:374 | prefill_token_num : 128
12:10:16.913 INF Init:379 | grp: 1, prefill_max_kv_cache_num : 1
12:10:16.913 INF Init:379 | grp: 2, prefill_max_kv_cache_num : 128
12:10:16.913 INF Init:379 | grp: 3, prefill_max_kv_cache_num : 256
12:10:16.913 INF Init:379 | grp: 4, prefill_max_kv_cache_num : 384
12:10:16.913 INF Init:379 | grp: 5, prefill_max_kv_cache_num : 512
12:10:16.913 INF Init:379 | grp: 6, prefill_max_kv_cache_num : 640
12:10:16.913 INF Init:379 | grp: 7, prefill_max_kv_cache_num : 768
12:10:16.913 INF Init:379 | grp: 8, prefill_max_kv_cache_num : 896
12:10:16.913 INF Init:379 | grp: 9, prefill_max_kv_cache_num : 1024
12:10:16.913 INF Init:379 | grp: 10, prefill_max_kv_cache_num : 1152
12:10:16.913 INF Init:384 | prefill_max_token_num : 1152
12:10:16.913 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ |  27 /  27 [33.16s<33.16s, 0.81 count/s] embed_selector init ok
12:10:21.317 INF Init:643 | Qwen-VL token ids: vision_start=248053 image_pad=248056 video_pad=248057
12:10:21.317 INF Init:668 | VisionModule init ok: type=Qwen3VL, tokens_per_block=144, embed_size=2048, out_dtype=fp32
12:10:21.317 WRN Init:677 | Vision preprocess backend: SimpleCV (OpenCV not found at build time; minor differences vs OpenCV are possible)
12:10:21.322 INF load_config:282 | load config: 
12:10:21.322 INF load_config:282 | {
12:10:21.322 INF load_config:282 |     "enable_repetition_penalty": false,
12:10:21.322 INF load_config:282 |     "enable_temperature": false,
12:10:21.322 INF load_config:282 |     "enable_top_k_sampling": true,
12:10:21.322 INF load_config:282 |     "enable_top_p_sampling": false,
12:10:21.322 INF load_config:282 |     "penalty_window": 20,
12:10:21.322 INF load_config:282 |     "repetition_penalty": 1.2,
12:10:21.322 INF load_config:282 |     "temperature": 0.9,
12:10:21.322 INF load_config:282 |     "top_k": 10,
12:10:21.322 INF load_config:282 |     "top_p": 0.8
12:10:21.322 INF load_config:282 | }
12:10:21.322 INF Init:448 | LLM init ok
Commands:
  /q, /exit  退出
  /reset     重置 kvcache
  /dd        删除一轮对话
  /pp        打印历史对话
Ctrl+C: 停止当前生成
VLM enabled: after each prompt, input image path (empty = text-only). Use "video:<frames_dir>" for video.
----------------------------------------
prompt >> describe the image
image >> image.png
12:11:48.361 INF EncodeForContent:973 | Qwen-VL pixel_values[0] bytes=884736 min=0 max=238 (w=384 h=384 tp=2 ps=16 sm=2)
12:11:48.513 INF EncodeForContent:996 | vision cache store: image.png
12:11:48.548 INF SetKVCache:747 | prefill_grpid:3 kv_cache_num:256 precompute_len:0 input_num_token:168
12:11:48.548 INF SetKVCache:749 | current prefill_max_token_num:1152
12:11:48.548 INF SetKVCache:750 | first run
12:11:48.591 INF Run:805 | input token num : 168, prefill_split_num : 2
12:11:48.591 INF Run:845 | prefill chunk p=0 history_len=0 grpid=1 kv_cache_num=0 input_tokens=128
12:11:48.591 INF Run:868 | prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=3
12:11:48.807 INF Run:845 | prefill chunk p=1 history_len=128 grpid=2 kv_cache_num=128 input_tokens=40
12:11:48.807 INF Run:868 | prefill indices shape: p=1 idx_elems=128 idx_rows=1 pos_rows=3
12:11:49.052 INF Run:1010 | ttft: 460.28 ms
<think>

</think>

Here is a detailed description of the image:

**Subject and Setting**
The image depicts three figures dressed in full, pristine white spacesuits, standing amidst a dense, overgrown forest. The scene evokes the surreal imagery of an astronaut from a classic comic book or film, exploring a hidden, magical world rather than a barren moon. The trees are thick and leafy, and the ground is covered in what looks like low-lying grasses or ferns that sway slightly, as if caught in a gentle breeze.

**Atmosphere and Mood**
The atmosphere is one of wonder, quiet exploration, and perhaps a hint of the bizarre. The lighting is soft and diffused, suggesting either an overcast sky or the filtered light through the canopy of the forest. The colors are natural greens, browns, and the stark white of the uniforms, which stand out against the earthy tones.

**Details and Composition**
The suits are detailed with zippers, reflective visors, and helmet seals, emphasizing their futuristic and utilitarian nature. The central figure appears to be the focus of the scene, facing slightly right with hands near the suit's side. The other two figures are positioned slightly to the left and right of the central one, forming a balanced group. The framing includes large, out-of-focus elements in the foreground on both sides (likely foliage or a window), which creates a sense of depth and suggests the viewer is looking through a window or observing the scene from a distance.

12:12:23.463 NTC Run:1132 | hit eos,avg 8.98 token/s
12:12:23.466 INF GetKVCache:721 | precompute_len:345, remaining:807

启动服务(OpenAI 兼容)

root@ax650:~# axllm serve AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047
[I][                            Init][ 138]: LLM init start
tokenizer_type = 1
 96% | ███████████████████████████████   |  30 /  31 [4.63s<4.79s, 6.47 count/s] init post axmodel ok,remain_cmm(9563 MB)
[I][                            Init][ 199]: max_token_len : 2047
[I][                            Init][ 202]: kv_cache_size : 1024, kv_cache_num: 2047
[I][                            Init][ 205]: prefill_token_num : 128
[I][                            Init][ 209]: grp: 1, prefill_max_kv_cache_num : 1
[I][                            Init][ 209]: grp: 2, prefill_max_kv_cache_num : 128
[I][                            Init][ 209]: grp: 3, prefill_max_kv_cache_num : 256
[I][                            Init][ 209]: grp: 4, prefill_max_kv_cache_num : 384
[I][                            Init][ 209]: grp: 5, prefill_max_kv_cache_num : 512
[I][                            Init][ 209]: grp: 6, prefill_max_kv_cache_num : 640
[I][                            Init][ 209]: grp: 7, prefill_max_kv_cache_num : 768
[I][                            Init][ 209]: grp: 8, prefill_max_kv_cache_num : 896
[I][                            Init][ 209]: grp: 9, prefill_max_kv_cache_num : 1024
[I][                            Init][ 209]: grp: 10, prefill_max_kv_cache_num : 1152
[I][                            Init][ 214]: prefill_max_token_num : 1152
[I][                            Init][  27]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  31 /  31 [4.64s<4.64s, 6.69 count/s] embed_selector init ok
[W][                            Init][ 457]: Qwen-VL vision size override: cfg=448x448 bytes=1204224, model_input_bytes=884736 -> 384x384 (square).
[I][                            Init][ 641]: Qwen-VL token ids: vision_start=151652 image_pad=151655 video_pad=151656
[I][                            Init][ 666]: VisionModule init ok: type=Qwen3VL, tokens_per_block=144, embed_size=2048, out_dtype=fp32
[I][                            Init][ 672]: VisionModule deepstack enabled: layers=3
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": false,
    "enable_top_k_sampling": false,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 272]: LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047'...
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047

OpenAI 调用示例

from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
completion = client.chat.completions.create(
    model=MODEL,
    messages=messages,
)

print(completion.choices[0].message.content)

OpenAI 流式调用示例

from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
stream = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    stream=True,
)

print("assistant:")
for ev in stream:
    delta = getattr(ev.choices[0], "delta", None)
    if delta and getattr(delta, "content", None):
        print(delta.content, end="", flush=True)
print("
")
Downloads last month
29
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047

Finetuned
Qwen/Qwen3.5-2B
Finetuned
(192)
this model

Collection including AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047