Instructions to use RthItalia/NanoLLM-Qwen2.5-3B-v3.1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use RthItalia/NanoLLM-Qwen2.5-3B-v3.1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="RthItalia/NanoLLM-Qwen2.5-3B-v3.1")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("RthItalia/NanoLLM-Qwen2.5-3B-v3.1")
model = AutoModelForMultimodalLM.from_pretrained("RthItalia/NanoLLM-Qwen2.5-3B-v3.1")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use RthItalia/NanoLLM-Qwen2.5-3B-v3.1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "RthItalia/NanoLLM-Qwen2.5-3B-v3.1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RthItalia/NanoLLM-Qwen2.5-3B-v3.1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/RthItalia/NanoLLM-Qwen2.5-3B-v3.1

SGLang

How to use RthItalia/NanoLLM-Qwen2.5-3B-v3.1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "RthItalia/NanoLLM-Qwen2.5-3B-v3.1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RthItalia/NanoLLM-Qwen2.5-3B-v3.1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "RthItalia/NanoLLM-Qwen2.5-3B-v3.1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RthItalia/NanoLLM-Qwen2.5-3B-v3.1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use RthItalia/NanoLLM-Qwen2.5-3B-v3.1 with Docker Model Runner:
```
docker model run hf.co/RthItalia/NanoLLM-Qwen2.5-3B-v3.1
```

RthItalia commited on May 1

Commit

a008b2c

verified ·

1 Parent(s): 69b9998

Update model card for full single safetensors

Browse files

Files changed (1) hide show

README.md +17 -47

README.md CHANGED Viewed

@@ -1,58 +1,28 @@
 ---
-language:
-  - en
-  - zh
-  - it
 license: other
-tags:
-  - quantization
-  - qwen
-  - qwen2.5
-  - mixed-precision
-  - inference
 library_name: transformers
-pipeline_tag: text-generation
 ---
-# NanoLLM Qwen v3.1
-NanoLLM v3.1 artifacts are compact overlay artifacts for Qwen2.5 models. The loader starts from the base model in bitsandbytes 8-bit mode, then replaces the modules that passed the NanoLLM cascade with `TrueQuantLinear` modules.
-## Validated Artifacts
-| Model | Artifact | Zip size | Gate | Avg cosine | Min cosine | Locked / 8-bit pending |
-| --- | --- | ---: | --- | ---: | ---: | ---: |
-| Qwen2.5-3B-Instruct | `final_artifact_3B.zip` | 799,189,680 bytes | PASS | 0.990625 | 0.984375 | 143 / 109 |
-| Qwen2.5-7B-Instruct | `final_artifact_7B.zip` | 891,419,698 bytes | PASS | 0.990625 | 0.98046875 | 66 / 130 |
-| Qwen2.5-14B-Instruct | `final_artifact_Qwen2.5-14B-Instruct_pruned_pass.zip` | 1,482,019,132 bytes | PASS | 0.990625 | 0.98046875 | 76 / 260 |
-The current release gate checks average next-token-logit cosine similarity against the 8-bit reference: `avg >= 0.99`. Minimum cosine is reported as a diagnostic.
-## Quick Start
 ```python
-from load_artifact import load_artifact
-model, tokenizer, spec = load_artifact("final_artifact_Qwen2.5-14B-Instruct")
-prompt = "Write a Python function to sort a list using bubble sort."
-inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
-outputs = model.generate(**inputs, max_new_tokens=160, do_sample=False)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
-Requirements:
-```bash
-pip install torch transformers accelerate bitsandbytes safetensors
-```
-## Runtime Notes
-- `build_reference_mode`: `8bit`
-- `reference_scope`: `original_baseline`
-- `pending_policy`: `leave_in_base_8bit`
-- `NANO_LOAD_4BIT=1` can be used experimentally to load the base model in 4-bit, but the release tests use 8-bit.
-## License
-The NanoLLM quantization pipeline is proprietary/internal. Generated artifacts are published for research and evaluation subject to the repository license terms.

 ---
 license: other
 library_name: transformers
+base_model: Qwen/Qwen2.5-3B-Instruct
+tags:
+- nanollm
+- qwen2.5
+- safetensors
+- text-generation
 ---
+# NanoLLM Qwen2.5-3B-Instruct v3.1
+Self-contained full NanoLLM model is in `full_single/`.
+Usage:
 ```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+repo_id = "RthItalia/NanoLLM-Qwen2.5-3B-v3.1"
+tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder="full_single", use_fast=True)
+model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder="full_single", device_map="auto")
 ```
+Validation against 8-bit reference:
+- avg cosine: 0.98984375
+- min cosine: 0.984375
+- gate: avg >= 0.985