# DAT Building: One-Time vs Every-Time - Detailed Explanation

## Overview

**DAT (Double-Array Trie) Building** is the process of converting a text-based vocabulary (JSON/list) into an optimized binary format that enables ultra-fast tokenization.

---

## The Building Process

### What Happens During DAT Building?

1. **Trie Construction** (Step 1)
   - Converts each vocabulary token into a tree structure
   - Each character/byte becomes a node in the tree
   - Common prefixes share the same path (e.g., "apple" and "apply" share "appl")

2. **Array Packing** (Step 2 - The Expensive Part)
   - Uses a "First-Fit" algorithm to find optimal positions in integer arrays
   - Compresses the tree into 3 parallel arrays: `base`, `check`, `values`
   - **This is computationally expensive**: O(n×m) where n=vocab_size, m=avg_token_length
   
3. **Binary Serialization** (Step 3)
   - Writes the arrays to a `.dat` binary file
   - Format: `[MAGIC|VERSION|SIZE|BASE_ARRAY|CHECK_ARRAY|VALUES_ARRAY]`
   - Enables memory-mapping for instant zero-copy loading

### Performance Cost

| Vocabulary Size | Build Time | DAT File Size |
|-----------------|------------|---------------|
| 367 tokens | ~38ms | 5 KB |
| 5,000 tokens | ~26s | 143 KB |
| 50,000 tokens | ~5-10min | ~1.5 MB |

---

## One-Time vs Every-Time

### ✅ CORRECT APPROACH: One-Time Build + Cache

**Build Once:**
- Run `compile_profiles.py` during:
  - Package development
  - First-time user setup
  - CI/CD pipeline

**Cache Forever:**
- Save `.dat` files to: `~/.cache/xerv/crayon/profiles/`
- OR distribute pre-built `.dat` files with the package
- Users never rebuild unless vocabulary changes

**Runtime:**
```python
# This should be INSTANT (just mmap)
vocab = CrayonVocab.load_profile("code")  # <1ms to load .dat
tokens = vocab.tokenize(text)              # 10M+ tokens/sec
```

### ❌ INCORRECT APPROACH: Build Every Time

```python
# BAD: Building from JSON every import
builder = DATBuilder()
builder.build(vocab)  # Takes 26 seconds for 5k vocab!
```

This would make the library unusable.

---

## Current Implementation Status

### What Works ✅

1. **DATBuilder** (`src/crayon/c_ext/dat_builder.py`)
   - ✅ Compiles vocab to DAT format
   - ✅ Saves binary files

2. **CrayonVocab.load_profile()** (`src/crayon/core/vocabulary.py`)
   - ✅ Checks for cached `.dat` file first
   - ✅ Falls back to `.json` if `.dat` not found
   - ✅ Calls `build_and_cache_profile()` if neither exists

3. **C++ Engine** (`src/crayon/c_ext/engine.cpp`)
   - ✅ Memory-maps `.dat` files via Python buffer protocol
   - ✅ Zero-copy instant loading (<1ms)
   - ✅ AVX2 SIMD tokenization (10M+ tok/sec)

### What's Missing ⚠️

1. **Pre-built .dat files not distributed**
   - Currently, `.dat` files must be built manually via `compile_profiles.py`
   - Should be included in package or built during `pip install`

2. **Vocabulary files not in cache**
   - `trained_vocab_*.json` files exist in project root
   - Not automatically copied to `~/.cache/xerv/crayon/profiles/`
   - `build_and_cache_profile()` should handle this

3. **`decode()` method missing**
   - README examples show `vocab.decode(tokens)`
   - Method doesn't exist in `CrayonVocab` class

---

## Recommended Workflow

### For Package Developers:

```bash
# 1. Train vocabularies (already done - trained_vocab_*.json exist)
python train_vocab.py

# 2. Compile to DAT format
python compile_profiles.py

# 3. Distribute .dat files with package
# - Include in MANIFEST.in
# - Copy to package installation directory
```

### For End Users:

```python
# Should just work (instant load from cached .dat)
from crayon import CrayonVocab
vocab = CrayonVocab.load_profile("code")  # <1ms
```

---

## Summary

| Aspect | Answer |
|--------|--------|
| **One-time or Every-time?** | **ONE-TIME** per vocabulary version |
| **Who builds?** | Developer OR first-time user setup |
| **Build frequency?** | Only when vocabulary changes |
| **Runtime cost?** | **<1ms** (just mmap, no rebuild) |
| **User experience?** | Instant, zero compilation delay |

**The DAT file is like a compiled binary** - you compile your source code once, then distribute/cache the binary for instant execution.