File size: 4,975 Bytes
294dab8
ad93950
 
69b9998
294dab8
ad93950
a008b2c
 
ad93950
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
294dab8
 
ad93950
294dab8
ad93950
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3af1773
 
ad93950
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
294dab8
 
ad93950
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ad8eb26
ad93950
 
 
5962858
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
language:
  - en
license: other
library_name: transformers
pipeline_tag: text-generation
base_model: Qwen/Qwen2.5-3B-Instruct
tags:
  - qwen2.5
  - quantization
  - mixed-precision
  - custom-code
  - text-generation
  - nanollm
model-index:
  - name: nano_compact_3b_qkvfp16
    results:
      - task:
          type: text-generation
        dataset:
          name: Internal 4-prompt smoke suite
          type: internal
        metrics:
          - type: model_size_gb
            value: 2.3432
          - type: vram_load_gb
            value: 2.3432
          - type: vram_peak_generate_gb
            value: 2.44
          - type: baseline_true_8bit_load_gb
            value: 3.1703
          - type: baseline_true_8bit_peak_gb
            value: 3.21
---

# Nano Compact 3B QKV-FP16

`RthItalia/nano_compact_3b_qkvfp16` is the validated compact self-contained variant derived from `Qwen/Qwen2.5-3B-Instruct`.

This release is not the original overlay artifact. It is the final exported self-contained folder that loads directly with `transformers` plus `trust_remote_code=True`.

## What This Variant Is

This model uses a mixed runtime policy:

- `q_proj`, `k_proj`, `v_proj`: stored and loaded in `fp16`
- `o_proj` and most of the remaining transformer body: stored in Nano compact format
- `model.embed_tokens`: stored as a single quantized copy
- `lm_head`: tied custom head over the quantized embeddings

The objective of this policy is not maximum compression at any cost. It is the best validated tradeoff found between:

- disk size
- VRAM usage
- quality relative to the true `8bit` baseline

## Validated Runtime Envelope

Measured on the validated `3B` run:

- model size: `2.3432 GB`
- allocated after load: `2.3432 GB`
- peak generation VRAM: `~2.44 GB`

True `8bit` baseline used for comparison:

- allocated after load: `3.1703 GB`
- peak generation VRAM: `~3.21 GB`

So this winner variant preserved a meaningful VRAM advantage over the `8bit` baseline while recovering enough quality to pass the smoke comparison used during validation.

## Quality Claim

The quality claim for this release is intentionally narrow:

- it was compared against the true `8bit` baseline on a small internal prompt suite
- it is not claimed to match the full original model in all tasks
- it is not claimed to outperform the base model

During development, more aggressive variants such as:

- fully tied quantized head (`tiedq`)
- fully quantized attention

reached better size and VRAM numbers but failed the quality gate against the true `8bit` reference.  
`qkvfp16` was the first variant that restored acceptable behavior on the reference prompt set while keeping a substantial memory advantage.

## How To Load

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "RthItalia/nano_compact_3b_qkvfp16"

tok = AutoTokenizer.from_pretrained(
    repo_id,
    use_fast=True,
    trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    trust_remote_code=True,
    device_map="cuda",
    dtype=torch.float16,
).eval()
```

## Example Generation

```python
messages = [
    {"role": "user", "content": "Explain what a neural network is in exactly 3 simple sentences."}
]

text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inp = tok(text, return_tensors="pt").to(next(model.parameters()).device)

with torch.no_grad():
    out = model.generate(
        **inp,
        max_new_tokens=120,
        do_sample=False,
        repetition_penalty=1.08,
        eos_token_id=tok.eos_token_id,
        pad_token_id=tok.eos_token_id,
    )

print(tok.decode(out[0][inp["input_ids"].shape[-1]:], skip_special_tokens=True))
```

## Requirements

```bash
pip install torch transformers accelerate safetensors
```

`bitsandbytes` is not required for this exported winner variant at runtime.

## Important Notes

- `trust_remote_code=True` is required.
- The custom runtime uses a `NanoTiedHead` implementation that ties output logits to the quantized embedding table without registering the embedding module twice.
- The custom linear layers use chunked forward paths to keep peak VRAM under control.

## Limitations

- Validation was narrow and engineering-driven, not a full benchmark suite.
- This release is specifically tuned around `Qwen/Qwen2.5-3B-Instruct`.
- It should be treated as a compact experimental runtime artifact, not as a drop-in scientific proof of broader architectural claims.

## License Note

The base model is derived from `Qwen/Qwen2.5-3B-Instruct`, but this compact release should follow the licensing and distribution terms chosen for this Nano release repository.  
For that reason the model card metadata uses `license: other` instead of asserting Apache coverage for the full release package.

## Provenance

- base model: `Qwen/Qwen2.5-3B-Instruct`
- winner policy name: `qkvfp16`
- published repo: `RthItalia/nano_compact_3b_qkvfp16`
---