--- language: - ary license: mit library_name: transformers pipeline_tag: text-generation tags: - nanochat - darija - moroccan-arabic - causal-lm - custom-code - preview - test-run --- # nanochat-darija-73m-base Base NanoChat causal language model for Moroccan Darija. This repo is exported in Hugging Face Transformers format with custom model code. Load it with `trust_remote_code=True`. ## Preview Checkpoint Notice This is a **pilot/test checkpoint**, not the final full-data model. It was trained to validate the Darija data pipeline, tokenizer, NanoChat architecture export, and SFT workflow before a larger billion-plus-token training run. The cleaned base corpus contains **5M Darija rows** and approximately **2B tokens** with the included tokenizer. That number describes the available cleaned corpus; this checkpoint was intentionally trained on a much smaller/shorter schedule. ## Model Details - Parameters: **73.5M** (73,531,538) - Context length: `2048` - Vocab size: `32768` - Layers: `6` - Hidden size: `384` - Attention heads: `3` - Checkpoint tag: `d6_target12` - Checkpoint step: `1062` - Export dtype: `bfloat16` ## Training Pretrained on the cleaned custom made Moroccan Darija FineWeb-Edu translation corpus. The instruction-tuned variant is small and experimental. It is useful for lightweight Darija chat tests, but it is not reliable for math, factuality, code debugging, translation fidelity, or safety-critical decisions. ## Usage ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "Lyte/nanochat-darija-73m-base" tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto", ) inputs = tokenizer("العصور الوسطى هي", return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=128, temperature=0.3, top_k=100, top_p=0.95, repetition_penalty=1.1, do_sample=True, ) print(tokenizer.decode(outputs[0], skip_special_tokens=False)) ``` ## Files - `model.safetensors`: model weights - `config.json`: NanoChat architecture config - `generation_config.json`: default sampling config - `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`: tokenizer files - `configuration_nanochat.py`, `modeling_nanochat.py`: custom Transformers code - `nanochat_export.json`: source checkpoint metadata ## Limitations This is a tiny model. Expect fluent-looking but wrong answers, repetition on some prompts, and brittle instruction following. Use it as a research artifact or local baseline, not as a production assistant.