Instructions to use AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047
- SGLang
How to use AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047 with Docker Model Runner:
docker model run hf.co/AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047
Qwen3.5-2B
This version of Qwen3.5-2B has been converted to run on the Axera NPU using w8a16 quantization.
Compatible with Pulsar2 version: 5.0
Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo :
Pulsar2 Link, How to Convert LLM from Huggingface to axmodel
Support Platform
- AX650
- AX650N DEMO Board
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card
Image Process
| Chips | input size | image num | ttft(168 tokens) | w8a16 | CMM | Flash |
|---|---|---|---|---|---|---|
| AX650 | 384*384 | 1 | 460 ms | 9.0 tokens/sec | 2.4GiB | 3.46GiB |
Video Process
| Chips | input size | image num | ttft(600 tokens) | w8a16 | CMM | Flash |
|---|---|---|---|---|---|---|
| AX650 | 384*384 | 8 | 1130 ms | 9.0 tokens/sec | 2.4GiB | 3.46GiB |
The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.
How to use
安装 axllm
方式一:克隆仓库后执行安装脚本:
git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh
方式二:一行命令安装(默认分支 axllm):
curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash
方式三:下载Github Actions CI 导出的可执行程序(适合没有编译环境的用户):
如果没有编译环境,请到:
https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm
下载 最新 CI 导出的可执行程序(axllm),然后:
chmod +x axllm
sudo mv axllm /usr/bin/axllm
模型下载(Hugging Face)
先创建模型目录并进入,然后下载到该目录:
mkdir -p AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047
cd AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047
hf download AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047 --local-dir .
# structure of the downloaded files
tree -L 3
`-- AXERA-TECH
`-- Qwen3.5-2B-AX650-C128-P1152-CTX2047
|-- qwen3_5_vision.axmodel
|-- README.md
|-- config.json
|-- image.png
|-- model.embed_tokens.weight.bfloat16.bin
|-- post_config.json
|-- qwen3_5_tokenizer.txt
|-- qwen3_5_text_p128_l0_together.axmodel
...
|-- qwen3_5_text_p128_l23_together.axmodel
|-- qwen3_5_text_post.axmodel
`-- vision_cache
3 directories, 39 files
Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board
运行(CLI)
root@ax650 ~/Qwen3.5-2B-AX650-C128-P1152-CTX2047 # ./axllm run ./
12:09:43.753 INF Init:218 | LLM init start
12:09:43.753 INF Init:226 | mixed attention enabled: full_attention_interval=4 ref_full_layer_idx=3
tokenizer_type = 3
96% | ############################## | 26 / 27 [33.16s<34.43s, 0.78 count/s] init post axmodel ok,remain_cmm(5470 MB)
12:10:16.913 INF Init:368 | max_token_len : 2047
12:10:16.913 INF Init:371 | kv_cache_size : 512, kv_cache_num: 2047
12:10:16.913 INF Init:374 | prefill_token_num : 128
12:10:16.913 INF Init:379 | grp: 1, prefill_max_kv_cache_num : 1
12:10:16.913 INF Init:379 | grp: 2, prefill_max_kv_cache_num : 128
12:10:16.913 INF Init:379 | grp: 3, prefill_max_kv_cache_num : 256
12:10:16.913 INF Init:379 | grp: 4, prefill_max_kv_cache_num : 384
12:10:16.913 INF Init:379 | grp: 5, prefill_max_kv_cache_num : 512
12:10:16.913 INF Init:379 | grp: 6, prefill_max_kv_cache_num : 640
12:10:16.913 INF Init:379 | grp: 7, prefill_max_kv_cache_num : 768
12:10:16.913 INF Init:379 | grp: 8, prefill_max_kv_cache_num : 896
12:10:16.913 INF Init:379 | grp: 9, prefill_max_kv_cache_num : 1024
12:10:16.913 INF Init:379 | grp: 10, prefill_max_kv_cache_num : 1152
12:10:16.913 INF Init:384 | prefill_max_token_num : 1152
12:10:16.913 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ | 27 / 27 [33.16s<33.16s, 0.81 count/s] embed_selector init ok
12:10:21.317 INF Init:643 | Qwen-VL token ids: vision_start=248053 image_pad=248056 video_pad=248057
12:10:21.317 INF Init:668 | VisionModule init ok: type=Qwen3VL, tokens_per_block=144, embed_size=2048, out_dtype=fp32
12:10:21.317 WRN Init:677 | Vision preprocess backend: SimpleCV (OpenCV not found at build time; minor differences vs OpenCV are possible)
12:10:21.322 INF load_config:282 | load config:
12:10:21.322 INF load_config:282 | {
12:10:21.322 INF load_config:282 | "enable_repetition_penalty": false,
12:10:21.322 INF load_config:282 | "enable_temperature": false,
12:10:21.322 INF load_config:282 | "enable_top_k_sampling": true,
12:10:21.322 INF load_config:282 | "enable_top_p_sampling": false,
12:10:21.322 INF load_config:282 | "penalty_window": 20,
12:10:21.322 INF load_config:282 | "repetition_penalty": 1.2,
12:10:21.322 INF load_config:282 | "temperature": 0.9,
12:10:21.322 INF load_config:282 | "top_k": 10,
12:10:21.322 INF load_config:282 | "top_p": 0.8
12:10:21.322 INF load_config:282 | }
12:10:21.322 INF Init:448 | LLM init ok
Commands:
/q, /exit 退出
/reset 重置 kvcache
/dd 删除一轮对话
/pp 打印历史对话
Ctrl+C: 停止当前生成
VLM enabled: after each prompt, input image path (empty = text-only). Use "video:<frames_dir>" for video.
----------------------------------------
prompt >> describe the image
image >> image.png
12:11:48.361 INF EncodeForContent:973 | Qwen-VL pixel_values[0] bytes=884736 min=0 max=238 (w=384 h=384 tp=2 ps=16 sm=2)
12:11:48.513 INF EncodeForContent:996 | vision cache store: image.png
12:11:48.548 INF SetKVCache:747 | prefill_grpid:3 kv_cache_num:256 precompute_len:0 input_num_token:168
12:11:48.548 INF SetKVCache:749 | current prefill_max_token_num:1152
12:11:48.548 INF SetKVCache:750 | first run
12:11:48.591 INF Run:805 | input token num : 168, prefill_split_num : 2
12:11:48.591 INF Run:845 | prefill chunk p=0 history_len=0 grpid=1 kv_cache_num=0 input_tokens=128
12:11:48.591 INF Run:868 | prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=3
12:11:48.807 INF Run:845 | prefill chunk p=1 history_len=128 grpid=2 kv_cache_num=128 input_tokens=40
12:11:48.807 INF Run:868 | prefill indices shape: p=1 idx_elems=128 idx_rows=1 pos_rows=3
12:11:49.052 INF Run:1010 | ttft: 460.28 ms
<think>
</think>
Here is a detailed description of the image:
**Subject and Setting**
The image depicts three figures dressed in full, pristine white spacesuits, standing amidst a dense, overgrown forest. The scene evokes the surreal imagery of an astronaut from a classic comic book or film, exploring a hidden, magical world rather than a barren moon. The trees are thick and leafy, and the ground is covered in what looks like low-lying grasses or ferns that sway slightly, as if caught in a gentle breeze.
**Atmosphere and Mood**
The atmosphere is one of wonder, quiet exploration, and perhaps a hint of the bizarre. The lighting is soft and diffused, suggesting either an overcast sky or the filtered light through the canopy of the forest. The colors are natural greens, browns, and the stark white of the uniforms, which stand out against the earthy tones.
**Details and Composition**
The suits are detailed with zippers, reflective visors, and helmet seals, emphasizing their futuristic and utilitarian nature. The central figure appears to be the focus of the scene, facing slightly right with hands near the suit's side. The other two figures are positioned slightly to the left and right of the central one, forming a balanced group. The framing includes large, out-of-focus elements in the foreground on both sides (likely foliage or a window), which creates a sense of depth and suggests the viewer is looking through a window or observing the scene from a distance.
12:12:23.463 NTC Run:1132 | hit eos,avg 8.98 token/s
12:12:23.466 INF GetKVCache:721 | precompute_len:345, remaining:807
启动服务(OpenAI 兼容)
root@ax650:~# axllm serve AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047
[I][ Init][ 138]: LLM init start
tokenizer_type = 1
96% | ███████████████████████████████ | 30 / 31 [4.63s<4.79s, 6.47 count/s] init post axmodel ok,remain_cmm(9563 MB)
[I][ Init][ 199]: max_token_len : 2047
[I][ Init][ 202]: kv_cache_size : 1024, kv_cache_num: 2047
[I][ Init][ 205]: prefill_token_num : 128
[I][ Init][ 209]: grp: 1, prefill_max_kv_cache_num : 1
[I][ Init][ 209]: grp: 2, prefill_max_kv_cache_num : 128
[I][ Init][ 209]: grp: 3, prefill_max_kv_cache_num : 256
[I][ Init][ 209]: grp: 4, prefill_max_kv_cache_num : 384
[I][ Init][ 209]: grp: 5, prefill_max_kv_cache_num : 512
[I][ Init][ 209]: grp: 6, prefill_max_kv_cache_num : 640
[I][ Init][ 209]: grp: 7, prefill_max_kv_cache_num : 768
[I][ Init][ 209]: grp: 8, prefill_max_kv_cache_num : 896
[I][ Init][ 209]: grp: 9, prefill_max_kv_cache_num : 1024
[I][ Init][ 209]: grp: 10, prefill_max_kv_cache_num : 1152
[I][ Init][ 214]: prefill_max_token_num : 1152
[I][ Init][ 27]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ | 31 / 31 [4.64s<4.64s, 6.69 count/s] embed_selector init ok
[W][ Init][ 457]: Qwen-VL vision size override: cfg=448x448 bytes=1204224, model_input_bytes=884736 -> 384x384 (square).
[I][ Init][ 641]: Qwen-VL token ids: vision_start=151652 image_pad=151655 video_pad=151656
[I][ Init][ 666]: VisionModule init ok: type=Qwen3VL, tokens_per_block=144, embed_size=2048, out_dtype=fp32
[I][ Init][ 672]: VisionModule deepstack enabled: layers=3
[I][ load_config][ 282]: load config:
{
"enable_repetition_penalty": false,
"enable_temperature": false,
"enable_top_k_sampling": false,
"enable_top_p_sampling": false,
"penalty_window": 20,
"repetition_penalty": 1.2,
"temperature": 0.9,
"top_k": 10,
"top_p": 0.8
}
[I][ Init][ 272]: LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047'...
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047
OpenAI 调用示例
from openai import OpenAI
API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047"
messages = [
{"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
{"role": "user", "content": "hello"},
]
client = OpenAI(api_key="not-needed", base_url=API_URL)
completion = client.chat.completions.create(
model=MODEL,
messages=messages,
)
print(completion.choices[0].message.content)
OpenAI 流式调用示例
from openai import OpenAI
API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3.5-2B-AX650-C128-P1152-CTX2047"
messages = [
{"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
{"role": "user", "content": "hello"},
]
client = OpenAI(api_key="not-needed", base_url=API_URL)
stream = client.chat.completions.create(
model=MODEL,
messages=messages,
stream=True,
)
print("assistant:")
for ev in stream:
delta = getattr(ev.choices[0], "delta", None)
if delta and getattr(delta, "content", None):
print(delta.content, end="", flush=True)
print("
")
- Downloads last month
- 29