indietyp commited on
Commit
0991c68
·
verified ·
1 Parent(s): 15a0d5e

initial upload

Browse files
Files changed (9) hide show
  1. README.md +132 -0
  2. history.train.jsonl +35 -0
  3. inference.py +528 -0
  4. manifest.onnx.json +35 -0
  5. manifest.train.json +45 -0
  6. model.json +0 -0
  7. model.onnx +3 -0
  8. model.pt +3 -0
  9. tokenizer.json +0 -0
README.md ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - slug-generation
7
+ - onnx
8
+ - embedding-to-text
9
+ - url-slug
10
+ - beam-search
11
+ library_name: onnxruntime
12
+ pipeline_tag: text2text-generation
13
+ ---
14
+
15
+ # vec2slug-v1-large
16
+
17
+ Generate URL slugs directly from text embeddings, without re-feeding
18
+ source text through a language model.
19
+
20
+ | | |
21
+ |---|---|
22
+ | **Parameters** | 24.8M |
23
+ | **Architecture** | Transformer decoder, 6L, d=512 |
24
+ | **Input** | OpenAI `text-embedding-3-small` (1536d) |
25
+ | **Vocab** | BPE, 5000 subwords |
26
+ | **Token F1** | 0.306 |
27
+ | **ONNX size** | 95.1 MiB |
28
+ | **Inference (CPU)** | ~66ms (M-series), ~258ms (budget VPS) |
29
+
30
+ This is the **larger** of two variants. It achieves the best Token F1 but at 2.2x the inference cost of the smaller model.
31
+
32
+ See also: [Vec2Slug V1-Small](https://huggingface.co/hashintel/vec2slug-v1-small)
33
+
34
+ ## Quickstart
35
+
36
+ ```bash
37
+ # install dependencies
38
+ pip install onnxruntime numpy
39
+
40
+ # or run directly with uv
41
+ uv run inference.py . --input embeddings.npy
42
+ ```
43
+
44
+ ```python
45
+ from inference import OnnxPredictor
46
+ import numpy as np
47
+
48
+ predictor = OnnxPredictor.from_dir(".")
49
+
50
+ # embeddings: [N, 1536] float32 from OpenAI text-embedding-3-small
51
+ slugs = predictor.predict(embeddings)
52
+ # ["how-neural-networks-learn", "climate-change-solutions", ...]
53
+ ```
54
+
55
+ PyTorch inference (requires `torch`):
56
+
57
+ ```python
58
+ from inference import PyTorchPredictor
59
+
60
+ predictor = PyTorchPredictor.from_dir(".")
61
+ slugs = predictor.predict(embeddings)
62
+ ```
63
+
64
+ ## How it works
65
+
66
+ The model is a prefix-conditioned transformer decoder. A precomputed text
67
+ embedding is linearly projected into the decoder's hidden space and placed
68
+ at position 0 as a prefix token. The decoder then autoregressively generates
69
+ BPE subword tokens that form a kebab-case URL slug.
70
+
71
+ Beam search uses bounded additive length reward with score-based optimal
72
+ stopping ([Huang et al. 2017](https://arxiv.org/abs/1702.02429)). All
73
+ decoding parameters are stored in `model.json`.
74
+
75
+ ## Files
76
+
77
+ | File | Description |
78
+ |---|---|
79
+ | `model.onnx` | ONNX model (forward pass only) |
80
+ | `model.json` | Sidecar: vocabulary, beam search config, stopwords |
81
+ | `model.pt` | PyTorch weights (`state_dict`) |
82
+ | `tokenizer.json` | BPE tokenizer (HuggingFace `tokenizers` format) |
83
+ | `inference.py` | Standalone inference script (`uv run` compatible) |
84
+ | `manifest.train.json` | Training configuration and results |
85
+ | `manifest.onnx.json` | Export verification (tolerance, argmax agreement) |
86
+ | `history.train.jsonl` | Training loss/metric curves |
87
+
88
+ ## Training
89
+
90
+ Trained on 2.3M documents from FineWeb-Edu with slugs extracted
91
+ from source URLs. The extraction pipeline filters on language, slug format,
92
+ Gopher repetition, and token count.
93
+
94
+ BPE vocabulary (5,000 subwords) with `-` as a special token. Trained for 36 epochs with label smoothing (0.1) and position-aware EOS loss weighting. Best checkpoint at step 70,560.
95
+
96
+ ## Evaluation
97
+
98
+ Evaluated on 5,000 held-out test samples using the full beam search
99
+ decoding pipeline.
100
+
101
+ | Metric | Value |
102
+ |---|---|
103
+ | Token F1 (macro) | 0.306 |
104
+ | Exact match | 2.1% |
105
+ | Validity | 100% |
106
+ | Vocab diversity | 97.8% |
107
+
108
+ ## Limitations
109
+
110
+ - Requires precomputed embeddings from OpenAI `text-embedding-3-small`.
111
+ Other embedding models will produce poor results.
112
+ - Trained on English web content. Non-English or domain-specific text
113
+ may produce generic or inaccurate slugs.
114
+ - Slugs reflect patterns in the training URLs, which include SEO-influenced
115
+ and editorially inconsistent sources.
116
+
117
+ ## Links
118
+
119
+ - [Blog post](https://hash.dev/blog/vec2slug)
120
+ - [Training code](https://github.com/hashintel/labs)
121
+ - [Vec2Slug V1-Small](https://huggingface.co/hashintel/vec2slug-v1-small)
122
+
123
+ ## Citation
124
+
125
+ ```bibtex
126
+ @misc{vec2slug2025,
127
+ title={vec2slug: URL Slug Generation from Text Embeddings},
128
+ author={Mahmoud, Bilal},
129
+ year={2025},
130
+ url={https://github.com/hashintel/labs}
131
+ }
132
+ ```
history.train.jsonl ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {"step": 2000, "epoch": 2, "train_loss": 4.106095605669401, "val_loss": 3.5962798564910887, "tok_f1": 0.1444776942012236, "mean_words": 4.9355, "lr": 0.0003, "wall_time": 1779222902.593431}
2
+ {"step": 4000, "epoch": 3, "train_loss": 3.413264048563655, "val_loss": 3.324702338409424, "tok_f1": 0.2016711557233616, "mean_words": 4.7035, "lr": 0.0003, "wall_time": 1779223948.1597261}
3
+ {"step": 6000, "epoch": 4, "train_loss": 3.2278357425950444, "val_loss": 3.215394763946533, "tok_f1": 0.22583124481727423, "mean_words": 4.8295, "lr": 0.0003, "wall_time": 1779224990.457532}
4
+ {"step": 8000, "epoch": 5, "train_loss": 3.1305528082016782, "val_loss": 3.154838008880615, "tok_f1": 0.24623787544155193, "mean_words": 4.788, "lr": 0.0003, "wall_time": 1779226040.401222}
5
+ {"step": 10000, "epoch": 6, "train_loss": 3.0648372187956725, "val_loss": 3.1069913497924806, "tok_f1": 0.24793856168341463, "mean_words": 4.969, "lr": 0.0003, "wall_time": 1779227077.782281}
6
+ {"step": 12000, "epoch": 7, "train_loss": 3.0171820744484097, "val_loss": 3.0755839088439942, "tok_f1": 0.2596720575176457, "mean_words": 4.941, "lr": 0.0003, "wall_time": 1779228114.887825}
7
+ {"step": 14000, "epoch": 8, "train_loss": 2.9800491321919766, "val_loss": 3.052339796447754, "tok_f1": 0.2559486954222248, "mean_words": 5.0645, "lr": 0.0003, "wall_time": 1779229152.6426811}
8
+ {"step": 16000, "epoch": 9, "train_loss": 2.9487837862643955, "val_loss": 3.0353144744873046, "tok_f1": 0.26408666970284617, "mean_words": 4.755, "lr": 0.0003, "wall_time": 1779230163.828542}
9
+ {"step": 18000, "epoch": 11, "train_loss": 2.918691721206019, "val_loss": 3.02962755279541, "tok_f1": 0.2688920691236868, "mean_words": 5.12, "lr": 0.0003, "wall_time": 1779231166.284846}
10
+ {"step": 20000, "epoch": 12, "train_loss": 2.896327673036307, "val_loss": 3.018091846084595, "tok_f1": 0.2724128310415075, "mean_words": 4.878, "lr": 0.0003, "wall_time": 1779232199.0462751}
11
+ {"step": 22000, "epoch": 13, "train_loss": 2.87821229420086, "val_loss": 3.0077461536407473, "tok_f1": 0.27315177722604195, "mean_words": 5.1035, "lr": 0.0003, "wall_time": 1779233254.125132}
12
+ {"step": 24000, "epoch": 14, "train_loss": 2.8617876689077253, "val_loss": 2.998493883895874, "tok_f1": 0.2770256465756466, "mean_words": 4.8905, "lr": 0.0003, "wall_time": 1779234303.187704}
13
+ {"step": 26000, "epoch": 15, "train_loss": 2.846088374496488, "val_loss": 2.9906312114715576, "tok_f1": 0.27703381985661396, "mean_words": 4.896, "lr": 0.0003, "wall_time": 1779235355.874115}
14
+ {"step": 28000, "epoch": 16, "train_loss": 2.8328490578439105, "val_loss": 2.983960963058472, "tok_f1": 0.2795972222222222, "mean_words": 4.9435, "lr": 0.0003, "wall_time": 1779236406.483165}
15
+ {"step": 30000, "epoch": 17, "train_loss": 2.820020103981039, "val_loss": 2.97227031211853, "tok_f1": 0.28214252634620285, "mean_words": 5.0595, "lr": 0.0003, "wall_time": 1779237476.096491}
16
+ {"step": 32000, "epoch": 18, "train_loss": 2.8092726084687767, "val_loss": 2.968260679626465, "tok_f1": 0.28473659257409256, "mean_words": 4.924, "lr": 0.0003, "wall_time": 1779238523.0281012}
17
+ {"step": 34000, "epoch": 20, "train_loss": 2.79349008795453, "val_loss": 2.977187242126465, "tok_f1": 0.2865114801864802, "mean_words": 4.9075, "lr": 0.0003, "wall_time": 1779239577.0179908}
18
+ {"step": 36000, "epoch": 21, "train_loss": 2.783505980300933, "val_loss": 2.9694487785339354, "tok_f1": 0.288755238062591, "mean_words": 4.858, "lr": 0.0003, "wall_time": 1779240639.721827}
19
+ {"step": 38000, "epoch": 22, "train_loss": 2.774734211295068, "val_loss": 2.965319557952881, "tok_f1": 0.2830145099181864, "mean_words": 4.9315, "lr": 0.0003, "wall_time": 1779241690.4812958}
20
+ {"step": 40000, "epoch": 23, "train_loss": 2.7663396469081585, "val_loss": 2.960056104660034, "tok_f1": 0.29040886058386056, "mean_words": 4.988, "lr": 0.0003, "wall_time": 1779242737.81421}
21
+ {"step": 42000, "epoch": 24, "train_loss": 2.75957179015756, "val_loss": 2.957438604736328, "tok_f1": 0.2905343975468975, "mean_words": 4.9165, "lr": 0.0003, "wall_time": 1779243786.238262}
22
+ {"step": 44000, "epoch": 25, "train_loss": 2.7523164791037815, "val_loss": 2.9523234798431397, "tok_f1": 0.29058897613824086, "mean_words": 4.9375, "lr": 0.0003, "wall_time": 1779244830.177305}
23
+ {"step": 46000, "epoch": 26, "train_loss": 2.7447811277795235, "val_loss": 2.9494457813262938, "tok_f1": 0.28798811188811185, "mean_words": 5.0245, "lr": 0.0003, "wall_time": 1779245868.350689}
24
+ {"step": 48000, "epoch": 27, "train_loss": 2.7385771292894536, "val_loss": 2.946452843475342, "tok_f1": 0.28848719752469754, "mean_words": 4.876, "lr": 0.0003, "wall_time": 1779246903.871413}
25
+ {"step": 50000, "epoch": 29, "train_loss": 2.728870005215236, "val_loss": 2.957064482879639, "tok_f1": 0.290686912515589, "mean_words": 4.911, "lr": 0.0003, "wall_time": 1779247946.015985}
26
+ {"step": 52000, "epoch": 30, "train_loss": 2.7219258368258132, "val_loss": 2.9526238201141357, "tok_f1": 0.2944186653216065, "mean_words": 4.7625, "lr": 0.0003, "wall_time": 1779248976.7653491}
27
+ {"step": 54000, "epoch": 31, "train_loss": 2.7171959208950165, "val_loss": 2.9489395374298097, "tok_f1": 0.28971268453768456, "mean_words": 4.812, "lr": 0.0003, "wall_time": 1779250006.8798962}
28
+ {"step": 56000, "epoch": 32, "train_loss": 2.711857982278668, "val_loss": 2.949110791015625, "tok_f1": 0.29125145589704415, "mean_words": 4.9335, "lr": 0.0003, "wall_time": 1779251042.63035}
29
+ {"step": 58000, "epoch": 33, "train_loss": 2.7074541541301547, "val_loss": 2.9462409435272217, "tok_f1": 0.2962148821766469, "mean_words": 4.908, "lr": 0.0003, "wall_time": 1779252071.914974}
30
+ {"step": 60000, "epoch": 34, "train_loss": 2.70361461964871, "val_loss": 2.944313480758667, "tok_f1": 0.29103940960999786, "mean_words": 4.9475, "lr": 0.0003, "wall_time": 1779253094.764807}
31
+ {"step": 62000, "epoch": 35, "train_loss": 2.698599462362122, "val_loss": 2.942076708984375, "tok_f1": 0.29306238744915214, "mean_words": 4.841, "lr": 0.0003, "wall_time": 1779254107.271397}
32
+ {"step": 64000, "epoch": 36, "train_loss": 2.6947960017598676, "val_loss": 2.937381767654419, "tok_f1": 0.295903315556992, "mean_words": 4.934, "lr": 0.0003, "wall_time": 1779255123.1438122}
33
+ {"step": 66000, "epoch": 38, "train_loss": 2.687774037942866, "val_loss": 2.948435255050659, "tok_f1": 0.2897239565989566, "mean_words": 4.964, "lr": 0.0003, "wall_time": 1779256134.4341109}
34
+ {"step": 68000, "epoch": 39, "train_loss": 2.6818021759542097, "val_loss": 2.9472034103393554, "tok_f1": 0.2949354034854035, "mean_words": 4.946, "lr": 0.0003, "wall_time": 1779257196.270357}
35
+ {"step": 70000, "epoch": 40, "train_loss": 2.678613240182306, "val_loss": 2.9431504138946534, "tok_f1": 0.29160234944793767, "mean_words": 5.032, "lr": 0.0003, "wall_time": 1779258283.2707899}
inference.py ADDED
@@ -0,0 +1,528 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # /// script
2
+ # requires-python = ">=3.12"
3
+ # dependencies = [
4
+ # "numpy>=1.24",
5
+ # "onnxruntime>=1.16",
6
+ # ]
7
+ # ///
8
+ """vec2slug: generate URL slugs from text embeddings.
9
+
10
+ Standalone inference script for vec2slug models. Loads an ONNX (or
11
+ PyTorch) model and its JSON sidecar, runs beam search decoding, and
12
+ returns kebab-case slugs.
13
+
14
+ Usage as a library:
15
+
16
+ from inference import OnnxPredictor
17
+ predictor = OnnxPredictor.from_dir(".")
18
+ slugs = predictor.predict(embeddings) # [N, input_dim] float32
19
+
20
+ Usage from the command line:
21
+
22
+ uv run inference.py . # random demo
23
+ uv run inference.py . --input embeddings.npy # real embeddings
24
+
25
+ PyTorch backend (requires torch):
26
+
27
+ from inference import PyTorchPredictor
28
+ predictor = PyTorchPredictor.from_dir(".")
29
+ """
30
+
31
+ from __future__ import annotations
32
+
33
+ import argparse
34
+ import json
35
+ import sys
36
+ from abc import ABC, abstractmethod
37
+ from pathlib import Path
38
+ from typing import TypedDict
39
+
40
+ import numpy as np
41
+
42
+
43
+ class ModelConfig(TypedDict):
44
+ input_dim: int
45
+ embed_dim: int
46
+ num_heads: int
47
+ num_layers: int
48
+ max_slug_tokens: int
49
+ vocab_size: int
50
+
51
+
52
+ class TokenConfig(TypedDict):
53
+ pad: int
54
+ bos: int
55
+ eos: int
56
+ unk: int
57
+ hyphen: int
58
+
59
+
60
+ class BeamSearchConfig(TypedDict):
61
+ beam_width: int
62
+ length_reward: float
63
+ reward_cap: int
64
+ min_decode_tokens: int
65
+ min_slug_words: int
66
+
67
+
68
+ class Sidecar(TypedDict):
69
+ model: ModelConfig
70
+ tokens: TokenConfig
71
+ vocab: dict[str, str] # token_id (str) -> token
72
+ beam_search: BeamSearchConfig
73
+ stopwords: list[str]
74
+
75
+
76
+ def _log_softmax(x: np.ndarray) -> np.ndarray:
77
+ """Numerically stable log-softmax over a 1-D array."""
78
+ x_max = x.max()
79
+ shifted = x - x_max
80
+ return shifted - np.log(np.exp(shifted).sum())
81
+
82
+
83
+ class SlugPredictor(ABC):
84
+ """Beam search slug predictor. Subclasses provide the forward pass."""
85
+
86
+ def __init__(self, sidecar: Sidecar):
87
+ tokens = sidecar["tokens"]
88
+ self.pad_idx = tokens["pad"]
89
+ self.bos_idx = tokens["bos"]
90
+ self.eos_idx = tokens["eos"]
91
+ self.unk_idx = tokens["unk"]
92
+ self.hyphen_idx = tokens["hyphen"]
93
+
94
+ self.id_to_token: dict[int, str] = {
95
+ int(k): v for k, v in sidecar["vocab"].items()
96
+ }
97
+
98
+ beam = sidecar["beam_search"]
99
+ self.beam_width: int = beam["beam_width"]
100
+ self.length_reward: float = beam["length_reward"]
101
+ self.reward_cap: int = beam["reward_cap"]
102
+ self.min_decode_tokens: int = beam["min_decode_tokens"]
103
+ self.min_slug_words: int = beam["min_slug_words"]
104
+ self.max_length: int = sidecar["model"]["max_slug_tokens"]
105
+ self.max_content_tokens: int = max(self.max_length - 1, 0)
106
+
107
+ self.stopwords: frozenset[str] = frozenset(sidecar["stopwords"])
108
+
109
+ def predict(self, embeddings: np.ndarray) -> list[str]:
110
+ """Predict slugs for a batch of embeddings.
111
+
112
+ Args:
113
+ embeddings: float32 array of shape [N, input_dim].
114
+
115
+ Returns:
116
+ List of kebab-case slug strings, one per embedding.
117
+ """
118
+ slugs = []
119
+ for i in range(len(embeddings)):
120
+ candidates = self._beam_search(embeddings[i : i + 1])
121
+ slugs.append(candidates[0][0] if candidates else "")
122
+ return slugs
123
+
124
+ def predict_topk(
125
+ self, embeddings: np.ndarray, k: int = 5
126
+ ) -> list[list[tuple[str, float]]]:
127
+ """Return top-k slug candidates with scores for each embedding."""
128
+ results = []
129
+ for i in range(len(embeddings)):
130
+ candidates = self._beam_search(embeddings[i : i + 1])
131
+ results.append(candidates[:k])
132
+ return results
133
+
134
+ @abstractmethod
135
+ def _forward(self, embeddings: np.ndarray, token_ids: np.ndarray) -> np.ndarray:
136
+ """Run the model: (embeddings, token_ids) -> logits.
137
+
138
+ Args:
139
+ embeddings: [batch, input_dim] float32
140
+ token_ids: [batch, seq_len] int64
141
+
142
+ Returns:
143
+ logits: [batch, seq_len, vocab_size] float32
144
+ """
145
+ raise NotImplementedError
146
+
147
+ def _decode_tokens(self, indices: list[int]) -> str:
148
+ """Decode token indices to a slug string, stopping at EOS."""
149
+ parts: list[str] = []
150
+ for idx in indices:
151
+ if idx == self.eos_idx:
152
+ break
153
+ if idx in (self.pad_idx, self.bos_idx):
154
+ continue
155
+ if idx == self.hyphen_idx:
156
+ parts.append("-")
157
+ else:
158
+ token = self.id_to_token.get(idx)
159
+ if token is not None:
160
+ parts.append(token)
161
+ return "".join(parts)
162
+
163
+ def _score(self, log_prob: float, tokens: list[int]) -> float:
164
+ """Score a completed beam using bounded additive length reward.
165
+
166
+ score = log_prob + r * min(word_count, B) + penalties
167
+ """
168
+ slug = self._decode_tokens(tokens).strip("-")
169
+ words = slug.split("-") if slug else []
170
+ word_count = len([w for w in words if w])
171
+
172
+ score = log_prob + self.length_reward * min(word_count, self.reward_cap)
173
+
174
+ # Trailing stopword penalty
175
+ if words and words[-1] in self.stopwords:
176
+ score -= 1.0
177
+
178
+ # Repetition penalty
179
+ content = [w for w in words if w and w not in self.stopwords]
180
+ if len(content) != len(set(content)):
181
+ score -= 2.0
182
+
183
+ return score
184
+
185
+ def _partial_score(self, log_prob: float, tokens: list[int]) -> float:
186
+ """Optimistic partial score for active beam ranking."""
187
+ slug = self._decode_tokens(tokens).strip("-")
188
+ words = [w for w in slug.split("-") if w] if slug else []
189
+ return log_prob + self.length_reward * min(len(words), self.reward_cap)
190
+
191
+ def _beam_search(self, embedding: np.ndarray) -> list[tuple[str, float]]:
192
+ """Beam search with score-based optimal stopping.
193
+
194
+ Uses bounded additive length reward with the Huang et al. (2017)
195
+ stopping criterion: stop when the best completed beam provably
196
+ dominates every active beam's upper bound.
197
+ """
198
+ bos = self.bos_idx
199
+ eos = self.eos_idx
200
+ pad = self.pad_idx
201
+ unk = self.unk_idx
202
+ k = self.beam_width
203
+ r = self.length_reward
204
+ B = self.reward_cap
205
+
206
+ active: list[tuple[float, list[int]]] = [(0.0, [bos])]
207
+ best_finished_score = -float("inf")
208
+ completed: list[tuple[float, list[int]]] = []
209
+ stopped_by_bound = False
210
+
211
+ for _step in range(self.max_length):
212
+ if not active:
213
+ break
214
+
215
+ candidates: list[tuple[float, list[int]]] = []
216
+
217
+ # Batch all active beams into a single forward pass
218
+ max_len = max(len(t) for _, t in active)
219
+ padded = [t + [pad] * (max_len - len(t)) for _, t in active]
220
+ input_ids = np.array(padded, dtype=np.int64)
221
+ embedding_batch = np.tile(embedding, (len(active), 1))
222
+
223
+ all_logits = self._forward(embedding_batch, input_ids)
224
+
225
+ for beam_idx, (log_prob, tokens) in enumerate(active):
226
+ next_logits = all_logits[beam_idx, len(tokens) - 1, :].copy()
227
+ content_length = len(tokens) - 1 # exclude BOS
228
+ force_eos = content_length >= self.max_content_tokens
229
+
230
+ # Suppress PAD and UNK always
231
+ next_logits[pad] = -np.inf
232
+ if unk is not None:
233
+ next_logits[unk] = -np.inf
234
+
235
+ if force_eos:
236
+ # Force EOS, but charge its model probability
237
+ log_probs = _log_softmax(next_logits)
238
+ top_indices = np.array([eos])
239
+ else:
240
+ if content_length < self.min_decode_tokens:
241
+ next_logits[eos] = -np.inf
242
+
243
+ slug_so_far = self._decode_tokens(tokens[1:]).strip("-")
244
+ words = slug_so_far.split("-") if slug_so_far else []
245
+ if len(words) < self.min_slug_words:
246
+ next_logits[eos] = -np.inf
247
+
248
+ if words and words[-1] in self.stopwords:
249
+ next_logits[eos] = -np.inf
250
+
251
+ log_probs = _log_softmax(next_logits)
252
+ top_count = min(k, len(log_probs))
253
+ top_indices = np.argpartition(log_probs, -top_count)[-top_count:]
254
+ top_indices = top_indices[np.argsort(log_probs[top_indices])[::-1]]
255
+
256
+ for j in range(len(top_indices)):
257
+ token_id = int(top_indices[j])
258
+ token_lp = float(log_probs[token_id])
259
+ if not np.isfinite(token_lp):
260
+ continue
261
+ new_log_prob = log_prob + token_lp
262
+ new_tokens = tokens + [token_id]
263
+
264
+ if token_id == eos:
265
+ score = self._score(new_log_prob, new_tokens)
266
+ completed.append((new_log_prob, new_tokens))
267
+ best_finished_score = max(best_finished_score, score)
268
+ else:
269
+ candidates.append((new_log_prob, new_tokens))
270
+
271
+ # Rank by partial objective for consistent pruning
272
+ candidates.sort(
273
+ key=lambda x: self._partial_score(x[0], x[1]), reverse=True
274
+ )
275
+ active = candidates[:k]
276
+
277
+ # Optimal stopping: best completed dominates all active upper bounds
278
+ if active and best_finished_score > -float("inf"):
279
+ max_active_lp = max(lp for lp, _ in active)
280
+ upper_bound = max_active_lp + r * B
281
+ if best_finished_score >= upper_bound:
282
+ stopped_by_bound = True
283
+ break
284
+
285
+ # Force-finish active beams by charging EOS probability
286
+ if active and not stopped_by_bound:
287
+ max_len = max(len(t) for _, t in active)
288
+ padded = [t + [pad] * (max_len - len(t)) for _, t in active]
289
+ input_ids = np.array(padded, dtype=np.int64)
290
+ embedding_batch = np.tile(embedding, (len(active), 1))
291
+ finish_logits = self._forward(embedding_batch, input_ids)
292
+
293
+ for bi, (log_prob, tokens) in enumerate(active):
294
+ nl = finish_logits[bi, len(tokens) - 1, :].copy()
295
+ nl[pad] = -np.inf
296
+ if unk is not None:
297
+ nl[unk] = -np.inf
298
+ lp = _log_softmax(nl)
299
+ eos_lp = float(lp[eos])
300
+ if np.isfinite(eos_lp):
301
+ completed.append((log_prob + eos_lp, tokens + [eos]))
302
+ else:
303
+ completed.append((log_prob - 5.0, tokens + [eos]))
304
+
305
+ # Deduplicate and rank
306
+ scored = [
307
+ (self._score(log_prob, tokens), tokens)
308
+ for log_prob, tokens in completed
309
+ ]
310
+ scored.sort(key=lambda x: -x[0])
311
+
312
+ seen: set[str] = set()
313
+ results: list[tuple[str, float]] = []
314
+ for score, tokens in scored:
315
+ slug = self._decode_tokens(tokens).strip("-")
316
+ if not slug or slug in seen:
317
+ continue
318
+ seen.add(slug)
319
+ results.append((slug, score))
320
+
321
+ return results
322
+
323
+
324
+ class OnnxPredictor(SlugPredictor):
325
+ """ONNX Runtime inference. No torch dependency."""
326
+
327
+ def __init__(self, session, sidecar: Sidecar):
328
+ super().__init__(sidecar)
329
+ self.session = session
330
+
331
+ @classmethod
332
+ def from_dir(cls, model_dir: str | Path) -> OnnxPredictor:
333
+ """Load from a directory containing model.onnx and model.json."""
334
+ import onnxruntime as ort
335
+
336
+ model_dir = Path(model_dir)
337
+ session = ort.InferenceSession(str(model_dir / "model.onnx"))
338
+ sidecar = json.loads((model_dir / "model.json").read_text())
339
+ return cls(session, sidecar)
340
+
341
+ def _forward(self, embeddings: np.ndarray, token_ids: np.ndarray) -> np.ndarray:
342
+ return self.session.run(
343
+ None,
344
+ {"src_embedding": embeddings, "token_ids": token_ids},
345
+ )[0]
346
+
347
+
348
+ def _load_pytorch_model(model_dir: Path, model_config: ModelConfig):
349
+ """Build and load the SlugDecoder. Requires torch.
350
+
351
+ The model is a prefix-conditioned transformer decoder: the source
352
+ embedding is projected into decoder space and placed at position 0,
353
+ followed by BOS and autoregressive token embeddings.
354
+ """
355
+ import torch
356
+ from torch import Tensor, nn
357
+
358
+ class DecoderBlock(nn.Module):
359
+ def __init__(self, embed_dim: int, num_heads: int, dropout: float):
360
+ super().__init__()
361
+ self.ln1 = nn.LayerNorm(embed_dim)
362
+ self.attn = nn.MultiheadAttention(
363
+ embed_dim, num_heads, dropout=dropout, batch_first=True
364
+ )
365
+ self.ln2 = nn.LayerNorm(embed_dim)
366
+ self.ffn = nn.Sequential(
367
+ nn.Linear(embed_dim, embed_dim * 4),
368
+ nn.GELU(),
369
+ nn.Dropout(dropout),
370
+ nn.Linear(embed_dim * 4, embed_dim),
371
+ nn.Dropout(dropout),
372
+ )
373
+
374
+ def forward(self, x: Tensor, attn_mask: Tensor) -> Tensor:
375
+ normed = self.ln1(x)
376
+ x = (
377
+ x
378
+ + self.attn(
379
+ normed, normed, normed, attn_mask=attn_mask, is_causal=True
380
+ )[0]
381
+ )
382
+ x = x + self.ffn(self.ln2(x))
383
+ return x
384
+
385
+ class SlugDecoder(nn.Module):
386
+ def __init__(
387
+ self,
388
+ vocab_size: int,
389
+ embed_dim: int,
390
+ num_heads: int,
391
+ num_layers: int,
392
+ input_dim: int,
393
+ max_length: int,
394
+ dropout: float = 0.1,
395
+ ):
396
+ super().__init__()
397
+ self.embed_dim = embed_dim
398
+ self.max_length = max_length
399
+ self.embedding_projection = nn.Linear(input_dim, embed_dim)
400
+ self.token_embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
401
+ self.position_embedding = nn.Embedding(max_length + 1, embed_dim)
402
+ self.dropout = nn.Dropout(dropout)
403
+ self.blocks = nn.ModuleList(
404
+ [DecoderBlock(embed_dim, num_heads, dropout) for _ in range(num_layers)]
405
+ )
406
+ self.ln_final = nn.LayerNorm(embed_dim)
407
+ self.output_projection = nn.Linear(embed_dim, vocab_size)
408
+
409
+ def forward(self, embeddings: Tensor, target_ids: Tensor) -> Tensor:
410
+ prefix = self.embedding_projection(embeddings).unsqueeze(1)
411
+ token_emb = self.token_embedding(target_ids)
412
+ seq = torch.cat([prefix, token_emb], dim=1)
413
+ positions = torch.arange(seq.size(1), device=seq.device)
414
+ seq = seq + self.position_embedding(positions)
415
+ seq = self.dropout(seq)
416
+ attn_mask = nn.Transformer.generate_square_subsequent_mask(
417
+ seq.size(1), device=seq.device
418
+ )
419
+ for block in self.blocks:
420
+ seq = block(seq, attn_mask)
421
+ seq = self.ln_final(seq)
422
+ return self.output_projection(seq[:, 1:, :])
423
+
424
+ model = SlugDecoder(
425
+ vocab_size=model_config["vocab_size"],
426
+ embed_dim=model_config["embed_dim"],
427
+ num_heads=model_config["num_heads"],
428
+ num_layers=model_config["num_layers"],
429
+ input_dim=model_config["input_dim"],
430
+ max_length=model_config["max_slug_tokens"],
431
+ )
432
+ model.load_state_dict(
433
+ torch.load(model_dir / "model.pt", map_location="cpu", weights_only=True)
434
+ )
435
+ model.eval()
436
+ return model
437
+
438
+
439
+ class PyTorchPredictor(SlugPredictor):
440
+ """PyTorch inference. Requires: pip install torch"""
441
+
442
+ def __init__(self, model, sidecar: Sidecar):
443
+ super().__init__(sidecar)
444
+ self.model = model
445
+
446
+ @classmethod
447
+ def from_dir(cls, model_dir: str | Path) -> PyTorchPredictor:
448
+ """Load from a directory containing model.pt and model.json."""
449
+ model_dir = Path(model_dir)
450
+ sidecar = json.loads((model_dir / "model.json").read_text())
451
+ model = _load_pytorch_model(model_dir, sidecar["model"])
452
+ return cls(model, sidecar)
453
+
454
+ def _forward(self, embeddings: np.ndarray, token_ids: np.ndarray) -> np.ndarray:
455
+ import torch
456
+
457
+ with torch.no_grad():
458
+ logits = self.model(
459
+ torch.from_numpy(embeddings),
460
+ torch.from_numpy(token_ids),
461
+ )
462
+ return logits.numpy()
463
+
464
+
465
+ def main():
466
+ parser = argparse.ArgumentParser(
467
+ description="Generate URL slugs from text embeddings",
468
+ )
469
+ parser.add_argument(
470
+ "model_dir",
471
+ type=Path,
472
+ help="Directory containing model.onnx and model.json",
473
+ )
474
+ parser.add_argument(
475
+ "--input",
476
+ type=Path,
477
+ default=None,
478
+ help="Path to .npy file with embeddings (shape [N, input_dim])",
479
+ )
480
+ parser.add_argument(
481
+ "--backend",
482
+ choices=["onnx", "pytorch"],
483
+ default="onnx",
484
+ help="Inference backend (default: onnx)",
485
+ )
486
+ parser.add_argument(
487
+ "--topk",
488
+ type=int,
489
+ default=1,
490
+ help="Number of candidates per embedding (default: 1)",
491
+ )
492
+ args = parser.parse_args()
493
+
494
+ # Load model
495
+ if args.backend == "onnx":
496
+ predictor = OnnxPredictor.from_dir(args.model_dir)
497
+ else:
498
+ predictor = PyTorchPredictor.from_dir(args.model_dir)
499
+
500
+ # Load or generate embeddings
501
+ sidecar = json.loads((args.model_dir / "model.json").read_text())
502
+ input_dim = sidecar["model"]["input_dim"]
503
+
504
+ if args.input is not None:
505
+ embeddings = np.load(args.input).astype(np.float32)
506
+ print(f"Loaded {len(embeddings)} embeddings from {args.input}", file=sys.stderr)
507
+ else:
508
+ embeddings = np.random.randn(3, input_dim).astype(np.float32)
509
+ print(
510
+ "No --input provided, using random embeddings (results will be nonsensical)",
511
+ file=sys.stderr,
512
+ )
513
+
514
+ # Predict
515
+ if args.topk > 1:
516
+ results = predictor.predict_topk(embeddings, k=args.topk)
517
+ for i, candidates in enumerate(results):
518
+ print(f"[{i}]")
519
+ for slug, score in candidates:
520
+ print(f" {score:+.2f} {slug}")
521
+ else:
522
+ slugs = predictor.predict(embeddings)
523
+ for slug in slugs:
524
+ print(slug)
525
+
526
+
527
+ if __name__ == "__main__":
528
+ main()
manifest.onnx.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "exported_at": "2026-05-23T17:36:12.280199+00:00",
3
+ "torch_version": "2.12.0",
4
+ "artifacts": [
5
+ "model.onnx"
6
+ ],
7
+ "sidecar": "model.json",
8
+ "onnx_size_bytes": 99694368,
9
+ "sidecar_size_bytes": 105072,
10
+ "verification": {
11
+ "onnxruntime_version": "1.26.0",
12
+ "random_inputs": {
13
+ "batch_1_max_diff": 1.9073486328125e-05,
14
+ "batch_4_max_diff": 2.6226043701171875e-05
15
+ },
16
+ "real_embeddings": {
17
+ "prediction_set": "seq2seq_bpe_d512_l6_t24_eos_seq2seq_test.parquet",
18
+ "n_samples": 5000,
19
+ "tolerance": {
20
+ "atol": 0.0001,
21
+ "rtol": 1e-05
22
+ },
23
+ "max_absolute_diff": 2.8908252716064453e-05,
24
+ "mean_absolute_diff": 2.8676997771981405e-06,
25
+ "p95_absolute_diff": 2.0503997802734375e-05,
26
+ "p99_absolute_diff": 2.342522202525288e-05,
27
+ "argmax_agreement": 5000,
28
+ "argmax_agreement_rate": 1.0,
29
+ "wilson_ci_95": [
30
+ 0.9992322698624194,
31
+ 1.0
32
+ ]
33
+ }
34
+ }
35
+ }
manifest.train.json ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "schema_version": 1,
3
+ "variant": "seq2seq",
4
+ "encoder": "openai",
5
+ "seed": 42,
6
+ "compression": null,
7
+ "tokenizer": "bpe",
8
+ "model": {
9
+ "input_dim": 1536,
10
+ "vocab_size": 5000,
11
+ "embed_dim": 512,
12
+ "num_heads": 8,
13
+ "num_layers": 6,
14
+ "dropout": 0.1,
15
+ "max_slug_tokens": 24
16
+ },
17
+ "training": {
18
+ "lr": 0.0003,
19
+ "weight_decay": 0.0001,
20
+ "batch_size": 1024,
21
+ "patience": 10,
22
+ "epochs": 50,
23
+ "eval_every": 2000,
24
+ "val_max_samples": 5000,
25
+ "checkpoint_every": 5000,
26
+ "keep_last_checkpoints": 5,
27
+ "f1_n_samples": 2000
28
+ },
29
+ "results": {
30
+ "best_val_loss": 2.937381767654419,
31
+ "best_step": 64000,
32
+ "total_steps": 64000,
33
+ "n_params": 24840072
34
+ },
35
+ "artifacts": [
36
+ "best.pt",
37
+ "tokenizer.json",
38
+ "history.jsonl",
39
+ "step_040000.pt",
40
+ "step_045000.pt",
41
+ "step_050000.pt",
42
+ "step_055000.pt",
43
+ "step_060000.pt"
44
+ ]
45
+ }
model.json ADDED
The diff for this file is too large to render. See raw diff
 
model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1cc982c9e13af132fa31fdabf8cd3b3be04660f12f3bc72706273cb57bbc8f9f
3
+ size 99694368
model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c22ce1b1b571d2eec1498b09f24afca25b7b1a4848587bab2cfa26f39c81e33e
3
+ size 99382065
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff