File size: 4,917 Bytes
e46c62c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28a895f
e46c62c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
library_name: rkllm
license: mit
language:
- en
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Llama-8B
tags:
  - rkllm
  - rknn-llm
  - rk3588
  - rockchip
  - edge-ai
  - llm
  - deepseek
pipeline_tag: text-generation
---
# DeepSeek-R1-Distill-Llama-8B — RKLLM build for RK3588 boards

### Built with DeepSeek

**Author:** @jamescallander  
**Source model:** [deepseek-ai/DeepSeek-R1-Distill-Llama-8B · Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B)  
**Target:** Rockchip RK3588 NPU via **RKNN-LLM Runtime**

> This repository hosts a **conversion** of `DeepSeek-R1-Distill-Llama-8B` for use on Rockchip RK3588 single-board computers (Orange Pi 5 plus, Radxa Rock 5b+, Banana Pi M7, etc.). Conversion was performed using the [RKNN-LLM toolkit](https://github.com/airockchip/rknn-llm?utm_source=chatgpt.com)

#### Conversion details

- RKLLM-Toolkit version: v1.2.1
- NPU driver: v0.9.8
- Python: 3.12
- Quantization: `w8a8_g128`
- Output: single-file `.rkllm` artifact
- Tokenizer: not required at runtime (UI handles prompt I/O)

## Intended use

- On-device inference on RK3588 SBCs.
- **Reasoning-focused** model — designed to handle multi-step thinking, problem-solving, and structured explanations.
- Well-suited for tasks that need **step-by-step reasoning** or more careful breakdowns than typical instruction models.

## Limitations

- Requires 9GB free memory
- Quantized build (`w8a8_g128`) may show small quality differences vs. full-precision upstream.
- Tested on Radxa Rock 5B+; other devices may require different drivers/toolkit versions.
- While strong at reasoning, performance is limited by RK3588’s NPU compared to high-end GPUs.

## Quick start (RK3588)

### 1) Install runtime

The RKNN-LLM toolkit and instructions can be found on the specific development board's manufacturer website or from [airockchip's github page](https://github.com/airockchip).

Download and install the required packages as per the toolkit's instructions.

### 2) Simple Flask server deployment

The simplest way the deploy the `.rkllm` converted model is using an example script provided in the toolkit in this directory: `rknn-llm/examples/rkllm_server_demo`

```bash
python3 <TOOLKIT_PATH>/rknn-llm/examples/rkllm_server_demo/flask_server.py \
  --rkllm_model_path <MODEL_PATH>/DeepSeek-R1-Distill-Llama-8B_w8a8_g128_rk3588.rkllm \
  --target_platform rk3588
```

### 3) Sending a request

A basic format for message request is:

```json
{
    "model":"DeepSeek-R1-Distill-Llama-8B",
    "messages":[{
        "role":"user",
        "content":"<YOUR_PROMPT_HERE>"}],
    "stream":false
}
```

Example request using `curl`:

```bash
curl -s -X POST <SERVER_IP_ADDRESS>:8080/rkllm_chat \
    -H 'Content-Type: application/json' \
    -d '{"model":"DeepSeek-R1-Distill-Llama-8B","messages":[{"role":"user","content":"In 2 or 3 sentences, who was Napoleon Bonaparte?"}],"stream":false}'
```

The response is formated in the following way:

```json
{
    "choices":[{
        "finish_reason":"stop",
        "index":0,
        "logprobs":null,
        "message":{
            "content":"<MODEL_REPLY_HERE">,
            "role":"assistant"}}],
        "created":null,
        "id":"rkllm_chat",
        "object":"rkllm_chat",
        "usage":{
            "completion_tokens":null,
            "prompt_tokens":null,
            "total_tokens":null}
}
```

Example response:

```json
{"choices":[{"finish_reason":"stop","index":0,"logprobs":null,"message":{"content":"","role":"assistant"}}],"created":null,"id":"rkllm_chat","object":"rkllm_chat","usage":{"completion_tokens":null,"prompt_tokens":null,"total_tokens":null}} 
```

#### Note on reasoning traces

This model outputs **intermediate reasoning text** (e.g., chains of thought) before its final response, enclosed by `</think>` markers.

- Many OpenAI-compatible UIs automatically **suppress or hide this internal reasoning**.
- If your client does not, you may see the reasoning steps along with the final answer.

### 4) UI compatibility

This server exposes an **OpenAI-compatible Chat Completions API**.

You can connect it to any OpenAI-compatible client or UI (for example: [Open WebUI](https://github.com/open-webui/open-webui?utm_source=chatgpt.com))

- Configure your client with the API base: `http://<SERVER_IP_ADDRESS>:8080` and use the endpoint: `/rkllm_chat`
- Make sure the `model` field matches the converted model’s name, for example:

```json
{
 "model": "DeepSeek-R1-Distill-Llama-8B",
 "messages": [{"role":"user","content":"Hello!"}],
 "stream": false
}
```

# License

This conversion follows the [MIT License](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/mit.md)

- Attribution: **Built with DeepSeek-R1-Distill-Llama-8B (DeepSeek-AI)**
- Required notice: see [`NOTICE`](NOTICE)
- Modifications: quantization (w8a8_g128), export to `.rkllm` format for RK3588 SBCs