HuggingFaceFW/fineweb-edu
Viewer β’ Updated β’ 3.5B β’ 440k β’ 1.15k
An ultra-compact 9-million parameter LLM trained from scratch. Based on the LLAMA architecture and trained on 6.5 Billion tokens of high-quality web data (Fineweb-edu and Fineweb-HQ) and synthetic textbooks and stories (Cosmopedia). michel-nano-v2 has a context length of 1024 tokens.
| Dataset | Weight |
|---|---|
HuggingFaceFW/fineweb-edu |
50% |
epfml/FineWeb-HQ |
30% |
HuggingFaceTB/cosmopedia (stories split) |
20% |
michel-nano-v2 uses a custom bpe tokenizer trained on 100_000 samples from the training data mixture with a vocab size of 10k + chatml special tokens.
All benchmarks are zero-shot and use normalized accuracy.
| Maker | Model | Hellaswag | ARC (easy) | PIQA | BLiMP | Average |
|---|---|---|---|---|---|---|
| finnianx | Michel-Nano-v2 | 27.40% | 35.90% | 56.75% | 72.52% | 48.14% |
| Axiomic Labs | GPT-S-5M | 27.39% | 33.16% | 57.13% | 72.21% | 47.47% |
| EleutherAI | pythia-31m | 27.14% | 33.88% | 56.26% | 67.78% | 46.27% |
| EleutherAI | pythia-14m | 26.20% | 32.28% | 55.88% | 66.75% | 45.28% |
| SupraLabs | Supra-Mini-v5-8M | 26.38% | 33.33% | 54.03% | 63.83% | 44.39% |
| LH-Tech-AI | Spark-5M-Base-v4 | 27.03% | 33.21% | 53.43% | 62.17% | 43.96% |
| SupraLabs | Supra-Mini-v4-2M | 25.52% | 30.98% | 51.90% | 60.57% | 42.24% |
michel-nano-v2 is intended to be a base for finetuning. it is a base model with zero post-training, making it very versatile.
Any tasks requiring more than basic logic
Instruction following (this is a base model)
Multilingual usage (without finetuning)