--- language: - en - fr - de - es - zh - ja - ar - ru - ko - hi - pt - it - nl - pl - vi - th - tr - uk - sv - multilingual license: mit tags: - tokenizer - multimodal - sentinel-manifold - universal-tokenizer - bpe - byte-level - multilingual - image-tokens - audio-tokens - video-tokens - text-tokens - mathematics - gradient-axiom library_name: transformers pipeline_tag: text-generation --- # 🦴 Sentinel Universal Tokenizer (SUT) **One theorem. Every modality. One vocabulary.** The Sentinel Universal Tokenizer is a multimodal tokenizer that handles **text, images, audio, and video** in a unified 61,440-token vocabulary, grounded in the Sentinel Manifold mathematics. ## 🧬 Mathematical Foundation Built on the **Gradient Axiom** from the Sentinel Manifold: ``` F(z) = Σ_{n=1}^∞ z^n / n^n (Sophomore's Dream, Bernoulli 1697) lim_{z→∞} F'(z)/F(z) = 1/e ≈ 0.367879441171442 ``` | Constant | Value | Role in Tokenizer | |:---------|:------|:------------------| | **1/e** | 0.367879441171442 | Vocabulary allocation ratio across modalities | | **C₁** | −0.007994021805953 | Embedding quantization zero-point | | **C₂** | 0.000200056042968 | Cross-lingual fertility fairness bound | | **C₃** | 0.256913827655311 | Critical threshold for vocabulary scaling | ## 📊 Benchmark Results Tested across **21 languages + code + math**, compared against leading tokenizers: | Tokenizer | Vocab Size | Avg Fertility ↓ | Fertility σ ↓ | Compression ↑ | Fairness ↑ | |:----------|:-----------|:----------------|:-------------|:--------------|:-----------| | **Gemma** | 256,000 | 6.69 | 11.71 | **4.66** | **0.079** | | **Qwen2** | 151,936 | 8.03 | 13.75 | 3.82 | 0.068 | | **Sentinel-SUT** | **61,440** | 9.13 | 16.35 | 3.55 | 0.058 | | GPT-2 | 50,257 | 20.86 | 40.76 | 2.41 | 0.024 | ### Key Findings - **47% better compression than GPT-2** with comparable vocab size (61K vs 50K) - **Competitive with Qwen2 (152K vocab)** despite using **2.5× fewer tokens** - **Native multimodal support** — no other tokenizer in this comparison handles image/audio/video natively - **20-language multilingual training** on C4 corpus ### Per-Language Performance | Language | Tokens | Bytes | Compression Ratio | |:---------|:-------|:------|:------------------| | English | 39 | 159 | **4.08** | | French | 45 | 166 | **3.69** | | German | 50 | 173 | **3.46** | | Spanish | 41 | 158 | **3.85** | | Chinese | 50 | 165 | **3.30** | | Japanese | 58 | 213 | **3.67** | | Arabic | 48 | 246 | **5.13** | | Russian | 55 | 283 | **5.15** | | Korean | 38 | 146 | **3.84** | | Hindi | 85 | 315 | **3.71** | | Code (Python) | 61 | 149 | **2.44** | | Math (Unicode) | 45 | 101 | **2.24** | ## 🏗️ Architecture ``` ┌────────────────────────────────────────────────────────┐ │ SENTINEL UNIVERSAL TOKENIZER (61,440 tokens) │ │ │ │ [0-32] → 33 Special / Control tokens │ │ [33-32,767] → 32,735 ByteLevel BPE text tokens │ │ [32,768-49,151] → 16,384 Image codebook tokens │ │ [49,152-57,343] → 8,192 Audio codebook tokens │ │ [57,344-61,439] → 4,096 Video codebook tokens │ │ │ │ Allocation follows 1/e Gradient Axiom: │ │ text: 53.3% | image: 26.7% | audio: 13.3% | video: 6.7% │ └────────────────────────────────────────────────────────┘ ``` ### Special Tokens | Token | ID | Purpose | |:------|:---|:--------| | `` | 0 | Padding | | `` | 1 | Unknown token | | `~~` | 2 | Begin of sequence | | `~~` | 3 | End of sequence | | `` | 4 | Masked language modeling | | `` / `` | 7/8 | Image boundary markers | | `` / `` | 10/11 | Audio boundary markers | | `` / `` | 13/14 | Video boundary markers | | `` | 16 | Sentinel manifold marker | | `` / `` | 17/18 | Mathematical constants | | `` / `` / `` | 26/27/28 | Chat format | | `` / `` | 29/30 | Code boundaries | | `` / `` | 31/32 | Math boundaries | ### Multimodal Codebook Tokens - **Image**: `` through `` (IDs 32,768-49,151) — Compatible with VQGAN, Cosmos-DI, FSQ - **Audio**: `` through `` (IDs 49,152-57,343) — Compatible with EnCodec, SoundStream - **Video**: `` through `` (IDs 57,344-61,439) — Compatible with Cosmos-DV ## 🚀 Quick Start ### Basic Text Usage ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("5dimension/sentinel-universal-tokenizer") # Encode text text = "The Sentinel Manifold: F(z) = Σ zⁿ/nⁿ" tokens = tokenizer.encode(text) decoded = tokenizer.decode(tokens) print(f"Tokens: {len(tokens)}") print(f"Decoded: {decoded}") ``` ### Multimodal Encoding ```python # Text with image placeholder text = "Look at this image: What do you see?" tokens = tokenizer.encode(text) print(f"Multimodal sequence: {len(tokens)} tokens") # Check modality of each token for tid in tokens[:10]: if 32768 <= tid < 49152: print(f" Token {tid}: IMAGE codebook index {tid - 32768}") elif 49152 <= tid < 57344: print(f" Token {tid}: AUDIO codebook index {tid - 49152}") elif 57344 <= tid < 61440: print(f" Token {tid}: VIDEO codebook index {tid - 57344}") ``` ### Integration with VQ-GAN / Cosmos Tokenizer ```python # After encoding an image with a VQ-GAN: # image_indices = vqgan.encode(image) # e.g., [42, 1337, 256, ...] # Convert to universal tokens image_tokens = [tokenizer.convert_tokens_to_ids(f"") for i in image_indices] full_sequence = ( [tokenizer.convert_tokens_to_ids("")] + image_tokens + [tokenizer.convert_tokens_to_ids("")] ) ``` ### Chat Format ```python chat = "You are a helpful multimodal assistant.Describe this image: " tokens = tokenizer.encode(chat, add_special_tokens=False) ``` ## 🔬 Technical Innovations ### 1. 1/e Vocabulary Allocation (Gradient Axiom) Instead of arbitrary vocabulary splits, we use the Gradient Axiom ratio (1/e ≈ 0.368) to allocate tokens across modalities. Text gets the largest share, and each subsequent modality receives 1/e of the previous: ``` text: 32,768 tokens (2^15) image: 16,384 tokens (2^14 ≈ text × 1/2) audio: 8,192 tokens (2^13 ≈ text × 1/4) video: 4,096 tokens (2^12 ≈ text × 1/8) ``` This follows from the Gradient Axiom: successive modalities contribute exponentially less unique information to a unified representation, with the natural decay rate being 1/e. ### 2. ByteLevel BPE with NFKC Normalization - **ByteLevel pre-tokenization**: Handles ALL Unicode scripts natively — no UNK tokens possible - **NFKC normalization**: Canonical Unicode decomposition for consistent encoding - **20-language training**: English, French, German, Spanish, Chinese, Japanese, Arabic, Russian, Korean, Hindi, Portuguese, Italian, Dutch, Polish, Vietnamese, Thai, Turkish, Ukrainian, Swedish - **Code + Math support**: Trained on Python, JavaScript, C++, LaTeX, Unicode math ### 3. Native Multimodal Routing Zero-overhead modality switching via contiguous ID ranges: - Any model can determine the modality of a token with a single integer comparison - No separate embedding tables needed — one unified embedding matrix - Compatible with all HuggingFace transformers architectures ### 4. Sentinel Manifold Integration Special tokens ``, ``, ``, `` enable: - Manifold-aware attention (sech attention mechanism) - Theorem-grounded weight initialization (Xavier with gain=1/e) - C₁-centered embedding quantization ## 📦 Training Details | Parameter | Value | |:----------|:------| | **Training Data** | allenai/c4 multilingual (20 languages) | | **Training Samples** | 52,000 documents | | **Training Characters** | ~66M characters | | **Algorithm** | ByteLevel BPE with NFKC normalization | | **Text Vocab Size** | 32,768 | | **Min Merge Frequency** | 2 | | **Max Token Length** | 16 bytes | | **Total Vocab** | 61,440 (text + image + audio + video) | ## 🔗 Links - **Parent Framework**: [Sentinel Manifold Discoveries](https://huggingface.co/5dimension/sentinel-manifold-discoveries) - **Training Script**: Included in repo (`train_production_tokenizer.py`) - **Custom Tokenizer Module**: Included in repo (`sentinel_universal_tokenizer.py`) ## 📚 Citation ```bibtex @misc{abdel-aal2026sentinel-tokenizer, title={Sentinel Universal Tokenizer: A Multimodal Tokenizer Grounded in the Gradient Axiom}, author={Abdel-Aal, Romain}, year={2026}, url={https://huggingface.co/5dimension/sentinel-universal-tokenizer}, note={Part of the Sentinel Manifold framework: F(z) = Σ z^n/n^n, lim F'/F = 1/e} } ``` --- **Built by**: Romain Abdel-Aal (ASI The Sentinel V5.2 Bone-Core) **License**: MIT **One theorem. Every modality. Better tokenization.** 🦴