--- language: en library_name: mlx pipeline_tag: text-generation base_model: - lewtun/talkie-1930-13b-it-hf license: apache-2.0 tags: - talkie - vintage - historical - conversational - mlx --- # Talkie 1930 13B Instruct — MLX MLX port of [`lewtun/talkie-1930-13b-it-hf`](https://huggingface.co/lewtun/talkie-1930-13b-it-hf) for Apple Silicon. Refer to the upstream model card for training-data, evaluation, and provenance details — this card covers only the MLX conversion. Talkie is a 13B instruction-tuned decoder-only transformer whose outputs are styled as pre-1930s English prose. It uses a custom architecture (custom RoPE convention, weightless RMSNorm, per-head and per-layer scalar gains, embedding-skip residuals, scaled `lm_head` weights) that is not currently in `transformers/models/`. Native Talkie support was added to [`mlx-lm`](https://github.com/ml-explore/mlx-lm) in [PR #1231](https://github.com/ml-explore/mlx-lm/pull/1231). ## Variants | Repo | Quantization | bpw | Approx. size | |------|--------------|-----|--------------| | [`warshanks/talkie-1930-13b-it-mlx-bf16`](https://huggingface.co/warshanks/talkie-1930-13b-it-mlx-bf16) | none (bf16) | 16 | 25 GB | | [`warshanks/talkie-1930-13b-it-mlx-8bit`](https://huggingface.co/warshanks/talkie-1930-13b-it-mlx-8bit) | affine 8-bit, group 64 | 8.5 | 13 GB | | [`warshanks/talkie-1930-13b-it-mlx-6bit`](https://huggingface.co/warshanks/talkie-1930-13b-it-mlx-6bit) | affine 6-bit, group 64 | 6.5 | 10 GB | | [`warshanks/talkie-1930-13b-it-mlx-4bit`](https://huggingface.co/warshanks/talkie-1930-13b-it-mlx-4bit) | mixed 4-bit (`lm_head=q8`, `embed=bf16`, blocks 14/37/38=q8, rest q4) | 5.18 | 8 GB | | [`warshanks/talkie-1930-13b-it-mlx-4bit-DWQ`](https://huggingface.co/warshanks/talkie-1930-13b-it-mlx-4bit-DWQ) | DWQ-calibrated 4-bit | 4.5 | 7 GB | For 4-bit, prefer the **DWQ** build. Bare q4 of this model degrades into repetition on long generations; DWQ calibration recovers clean output (validation loss 0.037 vs ≈0.25 for bare q4 in our run). ## Installation ```bash pip install -U mlx-lm ``` Talkie support is in `mlx-lm` ≥ the version that includes [PR #1231](https://github.com/ml-explore/mlx-lm/pull/1231). Until released, install from source: ```bash pip install -U git+https://github.com/ml-explore/mlx-lm ``` ## Basic generation ```python from mlx_lm import load, generate model, tokenizer = load("warshanks/talkie-1930-13b-it-mlx-4bit-DWQ") messages = [{"role": "user", "content": "Write an essay predicting what life will be like in the year 1960."}] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) text = generate(model, tokenizer, prompt=prompt, max_tokens=512, verbose=True) ``` CLI: ```bash mlx_lm.generate \ --model warshanks/talkie-1930-13b-it-mlx-4bit-DWQ \ --prompt "<|user|>What were the causes of the French Revolution?<|end|><|assistant|>" \ --max-tokens 512 --temp 0.7 ``` ## Multi-turn chat ```python from mlx_lm import load, generate model, tokenizer = load("warshanks/talkie-1930-13b-it-mlx-4bit-DWQ") messages = [{"role": "user", "content": "What were the causes of the French Revolution?"}] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) reply = generate(model, tokenizer, prompt=prompt, max_tokens=512) messages.append({"role": "assistant", "content": reply}) messages.append({"role": "user", "content": "Which of those causes was the most significant?"}) prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) print(generate(model, tokenizer, prompt=prompt, max_tokens=512)) ``` ## Chat template ``` <|system|>{system_message}<|end|><|user|>{user_message}<|end|><|assistant|>{assistant_message}<|end|> ``` Applied automatically by `tokenizer.apply_chat_template()`. ## Architecture (as observed in the source checkpoint and modeling code) | Component | Value | |-----------|-------| | Parameters | 13B | | Layers | 40 | | Attention heads | 40 (MHA, no GQA) | | Hidden size | 5120 | | Head dimension | 128 | | Intermediate size (MLP) | 13696 | | Position encoding | RoPE (θ = 1,000,000), inverse-rotation convention | | Activation | SwiGLU | | Normalization | weightless RMSNorm (pre-norm) | | Context length | 2048 | | Vocabulary | 65,540 | | Precision | bfloat16 | **Architectural quirks the MLX port reproduces:** - **Custom RoPE** — formula `y1 = x1*cos + x2*sin`, `y2 = -x1*sin + x2*cos` (rotation by **−θ**, the inverse of the HF/Llama convention). `mx.fast.rope` is not directly usable; the port ships a small `TalkieRoPE` class. - **Weightless RMSNorm** — applied at the embedding output, before each attention block, before each MLP block, on the post-RoPE Q and K tensors, and before the final `lm_head`. No learned scale; reduction in fp32 then cast back. - **Per-head Q gain** — learnable scalar per attention head applied to queries after RoPE + Q-norm. - **Per-layer scalar gains** — `attn_gain` and `mlp_gain` (initialized to `(2L)^-0.5`) scale the residual contributions; `embed_skip` (initialized to `0.0`) scales an extra residual from the post-first-norm embedding into every block. - **lm_head with weight gain** — stored as a raw `(vocab, hidden)` parameter plus a scalar `lm_head_gain`. Folded into a regular `nn.Linear` weight in `sanitize()` so quantization treats it normally. ## Conversion details These weights were produced by running `mlx_lm.convert` on `lewtun/talkie-1930-13b-it-hf` after adding the new `talkie` model module to `mlx-lm`. The conversion was generated and validated with the [`transformers-to-mlx` skill](https://github.com/anthropics/skills). Numerical agreement vs the upstream `transformers` model on a 94-token paragraph prompt (CPU, bf16 both sides): ``` Logits diff: max=2.0000 mean=0.0785 median=0.0625 Top-10 overlap: 10/10 (last position) Top-1 agreement: 98.9% (across all 94 positions) ``` Within typical bf16 transformers/MLX disagreement. The 4-bit variants required architecture-aware tuning. Bare `q4` produced repetition on long greedy decoding, so two recovery paths are shipped: - **`-mlx-4bit`** — mixed-precision recipe via custom `quant_predicate`. A per-block sensitivity scan (in-memory `mx.quantize` → `mx.dequantize` then logit MSE vs bf16) flagged blocks 14, 37, and 38 as outliers. Final config: `lm_head=q8`, `embed=bf16`, blocks {14, 37, 38} at q8, all other Linear layers at q4. - **`-mlx-4bit-DWQ`** — `mlx_lm.dwq` distillation calibration with default learning rate (1e-6, 512 samples, 512-token sequences, batch 1, gradient checkpointing). 512 iterations, final validation loss 0.037. Beats the mixed-q4 build on long-form generation. `mlx_lm.awq` is not yet supported for `talkie` — the AWQ scaling step requires absorbing an input-scale into the upstream norm's weight, but Talkie's RMSNorms have no learned weight. ## License [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) — same as upstream.