# EDEN architecture EDEN is a standard encoder-decoder Transformer trained from scratch for text enhancement. This document describes how the model is built. ## Overview The model reads a rough source sentence and generates a polished target sentence. It uses a shared byte-level BPE vocabulary for both the input and the output, and the input embedding matrix is tied to the output projection. ``` rough text | v [byte-level BPE tokenizer] | v [embedding + sinusoidal positional encoding] | v [Transformer encoder, 8 layers] -> memory | v [Transformer decoder, 8 layers] (attends to memory, causal self-attention) | v [tied linear language-model head] | v polished text ``` ## Configuration | Field | Value | Meaning | | --- | --- | --- | | `vocab_size` | 24000 | Byte-level BPE vocabulary size | | `d_model` | 640 | Hidden size | | `n_heads` | 10 | Attention heads per block | | `n_layers` | 8 | Encoder layers, and decoder layers | | `dim_feedforward` | 2560 | Feed-forward inner size | | `dropout` | 0.1 | Dropout probability | | `max_len` | 512 | Maximum positions | ## Key design choices * **Tied embeddings.** The language-model head shares its weight matrix with the input embedding. This reduces parameters and tends to improve quality on vocabulary-heavy tasks. * **Pre-norm blocks.** The encoder and decoder use `norm_first=True`, which makes deep Transformers more stable to train. * **GELU activations** in the feed-forward blocks. * **Sinusoidal positional encoding** stored as a buffer. In the Transformers integration this buffer is persistent so it is saved and restored correctly through safetensors and meta-device loading. * **Padding-aware attention.** Padding tokens are masked in both the encoder and the decoder, and the decoder uses a causal mask for self-attention. ## Special tokens | Token | Id | Role | | --- | --- | --- | | `[UNK]` | 0 | Unknown token | | `[PAD]` | 1 | Padding | | `[BOS]` | 2 | Beginning of sequence and decoder start | | `[EOS]` | 3 | End of sequence | ## Generation For inference the model supports three strategies: * **Beam search** (default), with a length penalty and a repetition penalty. This gives the most conservative, reliable edits. * **Greedy** decoding. * **Sampling** with temperature, top-k, and top-p filtering. Long inputs are split into sentence-aware chunks that each fit inside the 512 token window, rewritten independently, and joined back together. ## Two code paths, one architecture The exact same layer structure is defined in two places: * `eden/model.py` is the reference model used by the training engine. * `modeling_eden.py` is the Hugging Face Transformers wrapper. Because the module names and shapes match, a checkpoint trained with the engine loads into the Transformers model without any key remapping. The conversion script in `scripts/convert_checkpoint_to_hf.py` performs this step and writes the safetensors weights.