Instructions to use RthItalia/NanoLLM-Qwen2.5-3B-v3.1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use RthItalia/NanoLLM-Qwen2.5-3B-v3.1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="RthItalia/NanoLLM-Qwen2.5-3B-v3.1") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("RthItalia/NanoLLM-Qwen2.5-3B-v3.1") model = AutoModelForCausalLM.from_pretrained("RthItalia/NanoLLM-Qwen2.5-3B-v3.1") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use RthItalia/NanoLLM-Qwen2.5-3B-v3.1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "RthItalia/NanoLLM-Qwen2.5-3B-v3.1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RthItalia/NanoLLM-Qwen2.5-3B-v3.1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/RthItalia/NanoLLM-Qwen2.5-3B-v3.1
- SGLang
How to use RthItalia/NanoLLM-Qwen2.5-3B-v3.1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "RthItalia/NanoLLM-Qwen2.5-3B-v3.1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RthItalia/NanoLLM-Qwen2.5-3B-v3.1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "RthItalia/NanoLLM-Qwen2.5-3B-v3.1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RthItalia/NanoLLM-Qwen2.5-3B-v3.1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use RthItalia/NanoLLM-Qwen2.5-3B-v3.1 with Docker Model Runner:
docker model run hf.co/RthItalia/NanoLLM-Qwen2.5-3B-v3.1
Nano Compact 3B QKV-FP16
RthItalia/nano_compact_3b_qkvfp16 is the validated compact self-contained variant derived from Qwen/Qwen2.5-3B-Instruct.
This release is not the original overlay artifact. It is the final exported self-contained folder that loads directly with transformers plus trust_remote_code=True.
What This Variant Is
This model uses a mixed runtime policy:
q_proj,k_proj,v_proj: stored and loaded infp16o_projand most of the remaining transformer body: stored in Nano compact formatmodel.embed_tokens: stored as a single quantized copylm_head: tied custom head over the quantized embeddings
The objective of this policy is not maximum compression at any cost. It is the best validated tradeoff found between:
- disk size
- VRAM usage
- quality relative to the true
8bitbaseline
Validated Runtime Envelope
Measured on the validated 3B run:
- model size:
2.3432 GB - allocated after load:
2.3432 GB - peak generation VRAM:
~2.44 GB
True 8bit baseline used for comparison:
- allocated after load:
3.1703 GB - peak generation VRAM:
~3.21 GB
So this winner variant preserved a meaningful VRAM advantage over the 8bit baseline while recovering enough quality to pass the smoke comparison used during validation.
Quality Claim
The quality claim for this release is intentionally narrow:
- it was compared against the true
8bitbaseline on a small internal prompt suite - it is not claimed to match the full original model in all tasks
- it is not claimed to outperform the base model
During development, more aggressive variants such as:
- fully tied quantized head (
tiedq) - fully quantized attention
reached better size and VRAM numbers but failed the quality gate against the true 8bit reference.qkvfp16 was the first variant that restored acceptable behavior on the reference prompt set while keeping a substantial memory advantage.
How To Load
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
repo_id = "RthItalia/nano_compact_3b_qkvfp16"
tok = AutoTokenizer.from_pretrained(
repo_id,
use_fast=True,
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
repo_id,
trust_remote_code=True,
device_map="cuda",
dtype=torch.float16,
).eval()
Example Generation
messages = [
{"role": "user", "content": "Explain what a neural network is in exactly 3 simple sentences."}
]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inp = tok(text, return_tensors="pt").to(next(model.parameters()).device)
with torch.no_grad():
out = model.generate(
**inp,
max_new_tokens=120,
do_sample=False,
repetition_penalty=1.08,
eos_token_id=tok.eos_token_id,
pad_token_id=tok.eos_token_id,
)
print(tok.decode(out[0][inp["input_ids"].shape[-1]:], skip_special_tokens=True))
Requirements
pip install torch transformers accelerate safetensors
bitsandbytes is not required for this exported winner variant at runtime.
Important Notes
trust_remote_code=Trueis required.- The custom runtime uses a
NanoTiedHeadimplementation that ties output logits to the quantized embedding table without registering the embedding module twice. - The custom linear layers use chunked forward paths to keep peak VRAM under control.
Limitations
- Validation was narrow and engineering-driven, not a full benchmark suite.
- This release is specifically tuned around
Qwen/Qwen2.5-3B-Instruct. - It should be treated as a compact experimental runtime artifact, not as a drop-in scientific proof of broader architectural claims.
License Note
The base model is derived from Qwen/Qwen2.5-3B-Instruct, but this compact release should follow the licensing and distribution terms chosen for this Nano release repository.
For that reason the model card metadata uses license: other instead of asserting Apache coverage for the full release package.
Provenance
- base model:
Qwen/Qwen2.5-3B-Instruct - winner policy name:
qkvfp16 - published repo:
RthItalia/nano_compact_3b_qkvfp16
- Downloads last month
- 135
Model tree for RthItalia/NanoLLM-Qwen2.5-3B-v3.1
Evaluation results
- model_size_gb on Internal 4-prompt smoke suiteself-reported2.343
- vram_load_gb on Internal 4-prompt smoke suiteself-reported2.343
- vram_peak_generate_gb on Internal 4-prompt smoke suiteself-reported2.440
- baseline_true_8bit_load_gb on Internal 4-prompt smoke suiteself-reported3.170
- baseline_true_8bit_peak_gb on Internal 4-prompt smoke suiteself-reported3.210