Text Generation
Transformers
Safetensors
English
qwen2
qwen2.5
quantization
mixed-precision
custom-code
nanollm
conversational
Eval Results (legacy)
text-generation-inference
Instructions to use RthItalia/NanoLLM-Qwen2.5-3B-v3.1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use RthItalia/NanoLLM-Qwen2.5-3B-v3.1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="RthItalia/NanoLLM-Qwen2.5-3B-v3.1") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("RthItalia/NanoLLM-Qwen2.5-3B-v3.1") model = AutoModelForMultimodalLM.from_pretrained("RthItalia/NanoLLM-Qwen2.5-3B-v3.1") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use RthItalia/NanoLLM-Qwen2.5-3B-v3.1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "RthItalia/NanoLLM-Qwen2.5-3B-v3.1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RthItalia/NanoLLM-Qwen2.5-3B-v3.1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/RthItalia/NanoLLM-Qwen2.5-3B-v3.1
- SGLang
How to use RthItalia/NanoLLM-Qwen2.5-3B-v3.1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "RthItalia/NanoLLM-Qwen2.5-3B-v3.1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RthItalia/NanoLLM-Qwen2.5-3B-v3.1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "RthItalia/NanoLLM-Qwen2.5-3B-v3.1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RthItalia/NanoLLM-Qwen2.5-3B-v3.1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use RthItalia/NanoLLM-Qwen2.5-3B-v3.1 with Docker Model Runner:
docker model run hf.co/RthItalia/NanoLLM-Qwen2.5-3B-v3.1
File size: 4,975 Bytes
294dab8 ad93950 69b9998 294dab8 ad93950 a008b2c ad93950 294dab8 ad93950 294dab8 ad93950 3af1773 ad93950 294dab8 ad93950 ad8eb26 ad93950 5962858 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 | ---
language:
- en
license: other
library_name: transformers
pipeline_tag: text-generation
base_model: Qwen/Qwen2.5-3B-Instruct
tags:
- qwen2.5
- quantization
- mixed-precision
- custom-code
- text-generation
- nanollm
model-index:
- name: nano_compact_3b_qkvfp16
results:
- task:
type: text-generation
dataset:
name: Internal 4-prompt smoke suite
type: internal
metrics:
- type: model_size_gb
value: 2.3432
- type: vram_load_gb
value: 2.3432
- type: vram_peak_generate_gb
value: 2.44
- type: baseline_true_8bit_load_gb
value: 3.1703
- type: baseline_true_8bit_peak_gb
value: 3.21
---
# Nano Compact 3B QKV-FP16
`RthItalia/nano_compact_3b_qkvfp16` is the validated compact self-contained variant derived from `Qwen/Qwen2.5-3B-Instruct`.
This release is not the original overlay artifact. It is the final exported self-contained folder that loads directly with `transformers` plus `trust_remote_code=True`.
## What This Variant Is
This model uses a mixed runtime policy:
- `q_proj`, `k_proj`, `v_proj`: stored and loaded in `fp16`
- `o_proj` and most of the remaining transformer body: stored in Nano compact format
- `model.embed_tokens`: stored as a single quantized copy
- `lm_head`: tied custom head over the quantized embeddings
The objective of this policy is not maximum compression at any cost. It is the best validated tradeoff found between:
- disk size
- VRAM usage
- quality relative to the true `8bit` baseline
## Validated Runtime Envelope
Measured on the validated `3B` run:
- model size: `2.3432 GB`
- allocated after load: `2.3432 GB`
- peak generation VRAM: `~2.44 GB`
True `8bit` baseline used for comparison:
- allocated after load: `3.1703 GB`
- peak generation VRAM: `~3.21 GB`
So this winner variant preserved a meaningful VRAM advantage over the `8bit` baseline while recovering enough quality to pass the smoke comparison used during validation.
## Quality Claim
The quality claim for this release is intentionally narrow:
- it was compared against the true `8bit` baseline on a small internal prompt suite
- it is not claimed to match the full original model in all tasks
- it is not claimed to outperform the base model
During development, more aggressive variants such as:
- fully tied quantized head (`tiedq`)
- fully quantized attention
reached better size and VRAM numbers but failed the quality gate against the true `8bit` reference.
`qkvfp16` was the first variant that restored acceptable behavior on the reference prompt set while keeping a substantial memory advantage.
## How To Load
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
repo_id = "RthItalia/nano_compact_3b_qkvfp16"
tok = AutoTokenizer.from_pretrained(
repo_id,
use_fast=True,
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
repo_id,
trust_remote_code=True,
device_map="cuda",
dtype=torch.float16,
).eval()
```
## Example Generation
```python
messages = [
{"role": "user", "content": "Explain what a neural network is in exactly 3 simple sentences."}
]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inp = tok(text, return_tensors="pt").to(next(model.parameters()).device)
with torch.no_grad():
out = model.generate(
**inp,
max_new_tokens=120,
do_sample=False,
repetition_penalty=1.08,
eos_token_id=tok.eos_token_id,
pad_token_id=tok.eos_token_id,
)
print(tok.decode(out[0][inp["input_ids"].shape[-1]:], skip_special_tokens=True))
```
## Requirements
```bash
pip install torch transformers accelerate safetensors
```
`bitsandbytes` is not required for this exported winner variant at runtime.
## Important Notes
- `trust_remote_code=True` is required.
- The custom runtime uses a `NanoTiedHead` implementation that ties output logits to the quantized embedding table without registering the embedding module twice.
- The custom linear layers use chunked forward paths to keep peak VRAM under control.
## Limitations
- Validation was narrow and engineering-driven, not a full benchmark suite.
- This release is specifically tuned around `Qwen/Qwen2.5-3B-Instruct`.
- It should be treated as a compact experimental runtime artifact, not as a drop-in scientific proof of broader architectural claims.
## License Note
The base model is derived from `Qwen/Qwen2.5-3B-Instruct`, but this compact release should follow the licensing and distribution terms chosen for this Nano release repository.
For that reason the model card metadata uses `license: other` instead of asserting Apache coverage for the full release package.
## Provenance
- base model: `Qwen/Qwen2.5-3B-Instruct`
- winner policy name: `qkvfp16`
- published repo: `RthItalia/nano_compact_3b_qkvfp16`
--- |