Instructions to use jamescallander/DeepSeek-R1-Distill-Llama-8B_w8a8_g128_rk3588.rkllm with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- RKLLM
How to use jamescallander/DeepSeek-R1-Distill-Llama-8B_w8a8_g128_rk3588.rkllm with RKLLM:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| library_name: rkllm | |
| license: mit | |
| language: | |
| - en | |
| base_model: | |
| - deepseek-ai/DeepSeek-R1-Distill-Llama-8B | |
| tags: | |
| - rkllm | |
| - rknn-llm | |
| - rk3588 | |
| - rockchip | |
| - edge-ai | |
| - llm | |
| - deepseek | |
| pipeline_tag: text-generation | |
| # DeepSeek-R1-Distill-Llama-8B — RKLLM build for RK3588 boards | |
| ### Built with DeepSeek | |
| **Author:** @jamescallander | |
| **Source model:** [deepseek-ai/DeepSeek-R1-Distill-Llama-8B · Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B) | |
| **Target:** Rockchip RK3588 NPU via **RKNN-LLM Runtime** | |
| > This repository hosts a **conversion** of `DeepSeek-R1-Distill-Llama-8B` for use on Rockchip RK3588 single-board computers (Orange Pi 5 plus, Radxa Rock 5b+, Banana Pi M7, etc.). Conversion was performed using the [RKNN-LLM toolkit](https://github.com/airockchip/rknn-llm?utm_source=chatgpt.com) | |
| #### Conversion details | |
| - RKLLM-Toolkit version: v1.2.1 | |
| - NPU driver: v0.9.8 | |
| - Python: 3.12 | |
| - Quantization: `w8a8_g128` | |
| - Output: single-file `.rkllm` artifact | |
| - Tokenizer: not required at runtime (UI handles prompt I/O) | |
| ## Intended use | |
| - On-device inference on RK3588 SBCs. | |
| - **Reasoning-focused** model — designed to handle multi-step thinking, problem-solving, and structured explanations. | |
| - Well-suited for tasks that need **step-by-step reasoning** or more careful breakdowns than typical instruction models. | |
| ## Limitations | |
| - Requires 9GB free memory | |
| - Quantized build (`w8a8_g128`) may show small quality differences vs. full-precision upstream. | |
| - Tested on Radxa Rock 5B+; other devices may require different drivers/toolkit versions. | |
| - While strong at reasoning, performance is limited by RK3588’s NPU compared to high-end GPUs. | |
| ## Quick start (RK3588) | |
| ### 1) Install runtime | |
| The RKNN-LLM toolkit and instructions can be found on the specific development board's manufacturer website or from [airockchip's github page](https://github.com/airockchip). | |
| Download and install the required packages as per the toolkit's instructions. | |
| ### 2) Simple Flask server deployment | |
| The simplest way the deploy the `.rkllm` converted model is using an example script provided in the toolkit in this directory: `rknn-llm/examples/rkllm_server_demo` | |
| ```bash | |
| python3 <TOOLKIT_PATH>/rknn-llm/examples/rkllm_server_demo/flask_server.py \ | |
| --rkllm_model_path <MODEL_PATH>/DeepSeek-R1-Distill-Llama-8B_w8a8_g128_rk3588.rkllm \ | |
| --target_platform rk3588 | |
| ``` | |
| ### 3) Sending a request | |
| A basic format for message request is: | |
| ```json | |
| { | |
| "model":"DeepSeek-R1-Distill-Llama-8B", | |
| "messages":[{ | |
| "role":"user", | |
| "content":"<YOUR_PROMPT_HERE>"}], | |
| "stream":false | |
| } | |
| ``` | |
| Example request using `curl`: | |
| ```bash | |
| curl -s -X POST <SERVER_IP_ADDRESS>:8080/rkllm_chat \ | |
| -H 'Content-Type: application/json' \ | |
| -d '{"model":"DeepSeek-R1-Distill-Llama-8B","messages":[{"role":"user","content":"In 2 or 3 sentences, who was Napoleon Bonaparte?"}],"stream":false}' | |
| ``` | |
| The response is formated in the following way: | |
| ```json | |
| { | |
| "choices":[{ | |
| "finish_reason":"stop", | |
| "index":0, | |
| "logprobs":null, | |
| "message":{ | |
| "content":"<MODEL_REPLY_HERE">, | |
| "role":"assistant"}}], | |
| "created":null, | |
| "id":"rkllm_chat", | |
| "object":"rkllm_chat", | |
| "usage":{ | |
| "completion_tokens":null, | |
| "prompt_tokens":null, | |
| "total_tokens":null} | |
| } | |
| ``` | |
| Example response: | |
| ```json | |
| {"choices":[{"finish_reason":"stop","index":0,"logprobs":null,"message":{"content":"","role":"assistant"}}],"created":null,"id":"rkllm_chat","object":"rkllm_chat","usage":{"completion_tokens":null,"prompt_tokens":null,"total_tokens":null}} | |
| ``` | |
| #### Note on reasoning traces | |
| This model outputs **intermediate reasoning text** (e.g., chains of thought) before its final response, enclosed by `</think>` markers. | |
| - Many OpenAI-compatible UIs automatically **suppress or hide this internal reasoning**. | |
| - If your client does not, you may see the reasoning steps along with the final answer. | |
| ### 4) UI compatibility | |
| This server exposes an **OpenAI-compatible Chat Completions API**. | |
| You can connect it to any OpenAI-compatible client or UI (for example: [Open WebUI](https://github.com/open-webui/open-webui?utm_source=chatgpt.com)) | |
| - Configure your client with the API base: `http://<SERVER_IP_ADDRESS>:8080` and use the endpoint: `/rkllm_chat` | |
| - Make sure the `model` field matches the converted model’s name, for example: | |
| ```json | |
| { | |
| "model": "DeepSeek-R1-Distill-Llama-8B", | |
| "messages": [{"role":"user","content":"Hello!"}], | |
| "stream": false | |
| } | |
| ``` | |
| # License | |
| This conversion follows the [MIT License](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/mit.md) | |
| - Attribution: **Built with DeepSeek-R1-Distill-Llama-8B (DeepSeek-AI)** | |
| - Required notice: see [`NOTICE`](NOTICE) | |
| - Modifications: quantization (w8a8_g128), export to `.rkllm` format for RK3588 SBCs |