Update README.md

297e718 verified 8 months ago

6.78 kB

	---
	license: apache-2.0
	language:
	- en
	- code
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- smallcoder
	- code-llm
	- code-generation
	- sft
	- pretraining
	- tpu
	- 303m
	- trc
	datasets:
	- HuggingFaceFW/fineweb-edu
	- nvidia/Nemotron-Pretraining-SFT-v1
	- bigcode/starcoderdata
	- nvidia/Nemotron-Pretraining-Code-v1
	- HuggingFaceFW/finewiki
	- open-web-math/open-web-math
	- nvidia/Nemotron-CC-Math-v1
	- nvidia/OpenCodeInstruct
	- nvidia/OpenMathInstruct-2
	---

	# 🧠 SmallCoder (303M)

	SmallCoder is a 303M parameter LLaMA-style language model trained from scratch for code generation and algorithmic reasoning.

	This checkpoint represents a 6B-token Supervised Fine-Tuning (SFT) run that fixed a critical End-of-Sequence (EOS) token bug from earlier versions.

	Despite its compact size, SmallCoder achieves state-of-the-art (SOTA) coding performance for <500M models, rivaling 1B–7B parameter LLMs.

	> Trained with support from Google’s TPU Research Cloud (TRC) program.

	---

	## 🚀 Key Results

	\| Model \| Size \| HumanEval (pass@1) \| MBPP (pass@1) \|
	\|:------\|:----:\|:------------------:\|:--------------:\|
	\| SmallCoder (Stage 4.1) \| 303M \| 27.4 % \| 31.0 % \|
	\| TinyLlama-1.1B \| 1.1B \| ~26.4 % \| ~27.6 % \|
	\| MPT-1B-Instruct \| 1.0B \| ~22.0 % \| ~25.0 % \|
	\| Zephyr-1.3B-SFT \| 1.3B \| 31.0 % \| 34.0 % \|
	\| Mistral-7B-Base \| 7B \| 30.5 % \| 47.5 % \|

	> ⚖️ SmallCoder nearly matches Mistral 7B on HumanEval while being 23× smaller.

	---

	## 🧬 Model Architecture

	A LLaMA-type causal decoder with standard Multi-Head Attention (MHA).

	```python
	LlamaConfig(
	vocab_size=49152, # StarCoder tokenizer
	hidden_size=768,
	num_hidden_layers=24,
	num_attention_heads=8,
	num_key_value_heads=8,
	intermediate_size=3072,
	max_position_embeddings=1024,
	)
	````

	\| Parameter \| Value \|
	\| ----------------- \| ------------------------------ \|
	\| Total parameters \| ≈ 303 M \|
	\| Context length \| 1 024 tokens \|
	\| Tokenizer \| `bigcode/starcoder` \|
	\| Architecture type \| LLaMA (MHA, non-GQA) \|
	\| Precision \| bfloat16 \|
	\| Optimizer \| AdamW XLA \|
	\| Hardware \| TPU v4-32 (TRC) \|

	---

	## 📚 Training Curriculum (4 Stages, 29.8B tokens)

	\| Stage \| Tokens (B) \| Dataset \| Objective \| Loss ↓ \|
	\| :------------------------- \| :--------: \| :--------------------------------------------------- \| :------------------------------- \| :----------: \|
	\| 1. Linguistic Base \| 6.3 \| FineWeb-Edu \| General English grounding \| 10.87 → 2.58 \|
	\| 2. Code Specialization \| 7.5 \| 60 % Nemotron Synthetic Code / 40 % StarCoderData \| Code syntax & reasoning \| 5.00 → 1.25 \|
	\| 3. Math & Knowledge \| 10.0 \| Nemotron CC-Math / FineWiki / OpenWebMath \| Mathematical reasoning \| 2.77 → 1.55 \|
	\| 4.1 SFT (EOS Fixed) \| 6.0 \| Nemotron SFT / OpenCodeInstruct / OpenMathInstruct-2 \| Instruction-tuned code alignment \| 1.73 → ~0.70 \|

	> 🧩 Total ≈ 29.8 B tokens of curated curriculum learning.

	---

	## 📊 Detailed Benchmarks (Stage 4.1 SFT)

	\| Domain \| Benchmark \| Metric \| Score \|
	\| :-------------- \| :------------------- \| :----------- \| :-----------: \|
	\| Code \| HumanEval (0-shot) \| pass@1 \| 27.4 % \|
	\| Code \| MBPP (3-shot) \| pass@1 \| 31.0 % \|
	\| Math \| GSM8k (0-shot) \| exact match \| 4.55 % \|
	\| Knowledge \| Wikitext-2 \| perplexity ↓ \| 167.6 \|
	\| Reasoning \| ARC (Easy/Challenge) \| acc norm \| 34.6 / 22.8 % \|
	\| Commonsense \| HellaSwag \| acc norm \| 28.3 % \|

	> `humaneval`/`mbpp` were computed with manual evaluation (`max_new_tokens=512`, `temp=0.2`) due to SFT format truncation issues in `lm-eval`.

	---

	## ⚠️ Known Limitations

	1. Code-Specialized Model
	Tuned for Python and algorithmic reasoning. Poor performance on general text, math, and commonsense tasks.

	2. Short Context
	Trained on 1 024-token sequences only. Performance degrades on longer inputs.

	3. Tokenizer Bias
	Uses `bigcode/starcoder` BPE vocabulary — optimized for code, not prose.

	---

	## 💻 Usage Example

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_id = "Beebey/smallcoder-303m"
	device = "cuda" if torch.cuda.is_available() else "cpu"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device)

	prompt = """User: Write a Python function to compute Fibonacci numbers.
	Assistant:"""
	inputs = tokenizer(prompt, return_tensors="pt").to(device)

	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	eos_token_id=tokenizer.eos_token_id,
	pad_token_id=tokenizer.eos_token_id,
	)

	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	💡 Trained using the “User:” / “Assistant:” dialogue format.

	---

	## 🧾 Citation

	If you use SmallCoder (303M) in your research, please cite:

	```
	@misc{smallcoder303m,
	title = {SmallCoder: A 303M-parameter Code LLM trained from scratch},
	author = {Da Silva, Ilan},
	year = {2025},
	url = {https://huggingface.co/Beebey/smallcoder-303m},
	note = {Trained with Google TPU Research Cloud (TRC) support}
	}
	```

	---

	## 🙏 Acknowledgements

	This model was trained with support from the Google TPU Research Cloud (TRC) program.
	Special thanks to the open datasets that enabled this work:
	FineWeb, StarCoderData, Nemotron, and OpenWebMath.

	---

	## 🧩 Summary

	\| Category \| Description \|
	\| ------------------- \| --------------------------- \|
	\| Type \| Code LLM (LLaMA-style) \|
	\| Parameters \| 303 M \|
	\| Training tokens \| ~29.8 B \|
	\| Specialty \| Code generation & reasoning \|
	\| Context window \| 1 024 tokens \|
	\| Tokenizer \| `bigcode/starcoder` \|
	\| License \| Apache 2.0 \|
	\| Hardware \| TPU v4 (TRC Program) \|

	---

	> 🔬 SmallCoder (303M) demonstrates that a carefully designed <500M model can achieve near-SOTA coding performance, matching 1B-class models on HumanEval — proving that efficient, compact, open models still matter.

	```