EDEN / README.md

Upload EDEN model and code

fa2dad9 verified 5 days ago

7.3 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: transformers
	tags:
	- text-enhancement
	- grammar-correction
	- text-rewriting
	- encoder-decoder
	- transformer
	- pytorch
	---

	# EDEN: Encoder Decoder Enhancement Network

	EDEN is a from-scratch PyTorch encoder-decoder Transformer that rewrites rough
	text into clean, polished text. It fixes spelling, grammar, punctuation, and
	phrasing while keeping the original meaning. The model was built and trained
	from the ground up (architecture, tokenizer, and training loop) and runs
	comfortably on a single machine, including Apple Silicon.

	This repository contains everything needed to use the model, retrain it, and
	extend it:

	* The trained model weights in safetensors format.
	* A Hugging Face Transformers integration (`AutoModel` with `trust_remote_code`).
	* The full training, fine-tuning, and evaluation engine.
	* A local web dashboard for training and trying the model in a browser.

	## Model summary

	\| Property \| Value \|
	\| --- \| --- \|
	\| Architecture \| Encoder-decoder Transformer with tied embeddings \|
	\| Parameters \| About 107 million \|
	\| Encoder layers \| 8 \|
	\| Decoder layers \| 8 \|
	\| Hidden size \| 640 \|
	\| Attention heads \| 10 \|
	\| Feed-forward size \| 2560 \|
	\| Vocabulary \| 24,000 byte-level BPE tokens \|
	\| Max sequence length \| 512 tokens \|
	\| Held-out validation loss \| 0.123 (cross entropy) \|
	\| Precision \| float32 \|

	## Quick start

	First install the two dependencies (one time):

	```bash
	pip3 install torch transformers
	```

	### Option 1: chat with EDEN in the terminal (recommended)

	This opens a simple interactive interface, similar to Ollama. Type or paste
	rough text, press Enter, and get the cleaned-up version. Type `/bye` or press
	Ctrl+D to quit.

	```bash
	python3 examples/try_eden.py
	```

	macOS users can also double-click `Try EDEN.command` to open the same interface
	in a terminal window.

	Example session:

	```text
	>>> their are alot of reasons why this dont work proper
	There are a lot of reasons why this do not work proper.

	>>> /bye
	Goodbye.
	```

	### Option 2: one terminal command

	Paste this whole line into your terminal to clean a single sentence:

	```bash
	python3 -c "from transformers import AutoModel, AutoTokenizer; t=AutoTokenizer.from_pretrained('Rybib/EDEN', trust_remote_code=True); m=AutoModel.from_pretrained('Rybib/EDEN', trust_remote_code=True).eval(); print(m.enhance(t, 'i relly wnt this to sound beter'))"
	```

	### Option 3: a Python script

	The lines below are Python, not terminal commands. Save them as a file such as
	`run.py`, then run `python3 run.py`. Do not paste them straight into the
	terminal.

	```python
	from transformers import AutoModel, AutoTokenizer

	model_id = "Rybib/EDEN"

	tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
	model = AutoModel.from_pretrained(model_id, trust_remote_code=True).eval()

	rough = "i relly wnt this sentance to sound more profesional"
	print(model.enhance(tokenizer, rough))
	# I really want this sentence to sound more professional.
	```

	The `enhance` method handles long inputs by splitting them into sentence-aware
	chunks, rewriting each chunk, and joining the results.

	### Decoding options

	```python
	model.enhance(
	tokenizer,
	"their are alot of reasons why this dont work proper",
	strategy="beam", # "beam", "greedy", or "sample"
	beam_size=4,
	repetition_penalty=1.08,
	length_penalty=0.7,
	)
	```

	## What the model is good at

	EDEN was trained on rough-to-polished text pairs covering several editing skills:

	* Spelling and typo correction, including dyslexia-style letter swaps.
	* Grammar correction.
	* Punctuation and capitalization.
	* Clearer, more fluent rewriting and light paraphrasing.
	* Preserving the original meaning rather than inventing new content.

	It is an editing model, not a chatbot or a general text generator. Give it a
	sentence or paragraph to clean up, not a question or an instruction.

	## Training data

	The dataset is built from publicly available text-editing corpora plus generated
	noise, combined into rough-text to clean-text pairs:

	\| Source \| Role \|
	\| --- \| --- \|
	\| JFLEG \| Grammar correction examples \|
	\| Grammarly CoEdIT \| Correction and rewrite tasks \|
	\| W&I / LOCNESS \| Learner-English correction \|
	\| ASSET \| Sentence simplification \|
	\| WikiSplit \| Sentence and paragraph flow \|
	\| MRPC \| Meaning-preserving paraphrase pairs \|
	\| Synthetic noise \| Generated typos, swaps, punctuation, and capitalization fixes \|

	You can rebuild the dataset locally with the training engine described below.

	## Retrain or extend the model

	This repository ships the complete training engine as an importable `eden`
	package and a command-line tool.

	```bash
	pip install -r requirements.txt

	# Build the dataset and tokenizer, then train from scratch.
	python -m eden.cli prepare
	python -m eden.cli train

	# Continue training on your own examples.
	python -m eden.cli finetune --data my_pairs.jsonl --mix-base

	# Enhance text from the command line.
	python -m eden.cli enhance "i relly wnt this to sound beter"
	```

	Your own fine-tuning data is a JSONL file of input and target pairs:

	```jsonl
	{"input": "bad rough text here", "target": "Polished text here."}
	{"input": "another messy sentance", "target": "Another polished sentence."}
	```

	Keeping `--mix-base` on is recommended so the model learns your style without
	forgetting general spelling and grammar ability.

	See [docs/TRAINING.md](docs/TRAINING.md) for the full workflow and
	[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for how the model is built.

	## Local web dashboard

	```bash
	python -m eden.cli ui
	# then open http://127.0.0.1:7860
	```

	The dashboard can start, pause, and resume training, shows live loss and
	validation metrics, watches memory use, and runs a finished checkpoint directly
	in the browser.

	## Fine-tuning with the Transformers Trainer

	The model also supports standard supervised training. `forward` accepts `labels`
	and returns a loss, so it works with the Hugging Face `Trainer` for users who
	prefer that workflow. Tokens that should be ignored in the loss use the index
	`-100`, and `decoder_input_ids` are shifted from `labels` automatically.

	## Files in this repository

	\| File \| Purpose \|
	\| --- \| --- \|
	\| `model.safetensors` \| Trained model weights \|
	\| `config.json` \| Model configuration \|
	\| `configuration_eden.py` \| Configuration class for Transformers \|
	\| `modeling_eden.py` \| Model class for Transformers \|
	\| `tokenizer.json` \| Byte-level BPE tokenizer \|
	\| `eden/` \| Training, fine-tuning, and inference engine \|
	\| `scripts/` \| Checkpoint conversion and upload helpers \|
	\| `examples/` \| Runnable usage examples \|
	\| `docs/` \| Architecture and training guides \|

	## Limitations

	* English only.
	* Best on sentence and paragraph length inputs, up to 512 tokens per chunk.
	* It can occasionally change wording more than intended. Beam search with the
	default penalties gives the most conservative edits.
	* It is not designed to answer questions, follow instructions, or generate new
	content from scratch.

	## License

	Released under the Apache License 2.0. See [LICENSE](LICENSE).

	## Citation

	```bibtex
	@software{eden_text_enhancement,
	title = {EDEN: Encoder Decoder Enhancement Network},
	author = {Dunn, Ryan},
	year = {2026},
	url = {https://huggingface.co/Rybib/EDEN}
	}
	```