---
language:
- ary
license: mit
library_name: transformers
pipeline_tag: text-generation
tags:
- nanochat
- darija
- moroccan-arabic
- causal-lm
- custom-code
- preview
- test-run
---

# nanochat-darija-73m-base

Base NanoChat causal language model for Moroccan Darija.

This repo is exported in Hugging Face Transformers format with custom model code. Load it with `trust_remote_code=True`.

## Preview Checkpoint Notice

This is a **pilot/test checkpoint**, not the final full-data model. It was trained to validate the Darija data pipeline, tokenizer, NanoChat architecture export, and SFT workflow before a larger billion-plus-token training run.

The cleaned base corpus contains **5M Darija rows** and approximately **2B tokens** with the included tokenizer. That number describes the available cleaned corpus; this checkpoint was intentionally trained on a much smaller/shorter schedule.

## Model Details

- Parameters: **73.5M** (73,531,538)
- Context length: `2048`
- Vocab size: `32768`
- Layers: `6`
- Hidden size: `384`
- Attention heads: `3`
- Checkpoint tag: `d6_target12`
- Checkpoint step: `1062`
- Export dtype: `bfloat16`

## Training

Pretrained on the cleaned custom made Moroccan Darija FineWeb-Edu translation corpus.

The instruction-tuned variant is small and experimental. It is useful for lightweight Darija chat tests, but it is not reliable for math, factuality, code debugging, translation fidelity, or safety-critical decisions.

## Usage

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Lyte/nanochat-darija-73m-base"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

inputs = tokenizer("العصور الوسطى هي", return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    temperature=0.3,
    top_k=100,
    top_p=0.95,
    repetition_penalty=1.1,
    do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))
```

## Files

- `model.safetensors`: model weights
- `config.json`: NanoChat architecture config
- `generation_config.json`: default sampling config
- `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`: tokenizer files
- `configuration_nanochat.py`, `modeling_nanochat.py`: custom Transformers code
- `nanochat_export.json`: source checkpoint metadata

## Limitations

This is a tiny model. Expect fluent-looking but wrong answers, repetition on some prompts, and brittle instruction following. Use it as a research artifact or local baseline, not as a production assistant.