# Training and fine-tuning EDEN This guide covers retraining EDEN from scratch, fine-tuning it on your own data, and converting a checkpoint for publishing. ## Install ```bash pip install -r requirements.txt ``` ## Where files live All training artifacts are written under a workspace folder named `eden_system`, created next to where you run the commands. You can move the workspace by setting the `EDEN_HOME` environment variable: ```bash export EDEN_HOME=/path/to/workspace ``` The layout is: ``` eden_system/ data/ prepared pairs, tokenizer, training config checkpoints/ default checkpoint folder training_sessions/ numbered training runs, each with its own checkpoints run/ live metrics, logs, and run state exports/ exported artifacts ``` ## Prepare the dataset ```bash python -m eden.cli prepare ``` This downloads and combines the source corpora, generates synthetic noise pairs, trains the byte-level BPE tokenizer, and writes everything into `eden_system/data`. ## Train from scratch ```bash python -m eden.cli train ``` Recipes control model size and memory use: ```bash python -m eden.cli train --recipe survivor # smallest, always runs python -m eden.cli train --recipe m5-smart # balanced default python -m eden.cli train --recipe m5-large # largest, matches this release ``` Start with `m5-smart`. Move to `m5-large` only after a smaller recipe trains without memory stops. To resume: ```bash python -m eden.cli train --resume eden_system/checkpoints/latest.pt ``` ## Fine-tune on your own examples Create a JSONL file of input and target pairs: ```jsonl {"input": "bad rough text here", "target": "Polished text here."} {"input": "another messy sentance", "target": "Another polished sentence."} ``` CSV and TSV files with `input` and `target` columns also work. Then run: ```bash python -m eden.cli finetune --data my_pairs.jsonl --mix-base ``` `--mix-base` blends in the base dataset so the model learns your style without forgetting general spelling and grammar ability. Use a low learning rate for fine-tuning, for example `--lr 0.00008`. ## Evaluate ```bash python -m eden.cli eval --checkpoint eden_system/checkpoints/best.pt ``` ## Convert a checkpoint for Hugging Face Once you have a checkpoint you like, convert it into safetensors plus the configuration and tokenizer files: ```bash python scripts/convert_checkpoint_to_hf.py \ --checkpoint eden_system/checkpoints/best.pt \ --tokenizer eden_system/data/tokenizer.json \ --out . ``` Then upload: ```bash python scripts/push_to_hub.py --repo-id Rybib/EDEN ``` ## Memory safety EDEN keeps PyTorch MPS inside a bounded memory budget and stops with a resumable checkpoint if memory use gets too high. A saved checkpoint is much better than a frozen machine. The cutoff is configurable through the training config and the recipe. ## The web dashboard ```bash python -m eden.cli ui # open http://127.0.0.1:7860 ``` The dashboard can start, pause, resume, and monitor training, and run a finished checkpoint in the browser. It launches training as a separate process using `python -m eden.cli`, so make sure the `eden` package is importable from the folder you launch it in.