---
language: en
library_name: mlx
pipeline_tag: text-generation
base_model:
  - lewtun/talkie-1930-13b-it-hf
license: apache-2.0
tags:
  - talkie
  - vintage
  - historical
  - conversational
  - mlx
---

# Talkie 1930 13B Instruct — MLX

MLX port of [`lewtun/talkie-1930-13b-it-hf`](https://huggingface.co/lewtun/talkie-1930-13b-it-hf) for Apple Silicon. Refer to the upstream model card for training-data, evaluation, and provenance details — this card covers only the MLX conversion.

Talkie is a 13B instruction-tuned decoder-only transformer whose outputs are styled as pre-1930s English prose. It uses a custom architecture (custom RoPE convention, weightless RMSNorm, per-head and per-layer scalar gains, embedding-skip residuals, scaled `lm_head` weights) that is not currently in `transformers/models/`.

Native Talkie support was added to [`mlx-lm`](https://github.com/ml-explore/mlx-lm) in [PR #1231](https://github.com/ml-explore/mlx-lm/pull/1231).

## Variants

| Repo | Quantization | bpw | Approx. size |
|------|--------------|-----|--------------|
| [`warshanks/talkie-1930-13b-it-mlx-bf16`](https://huggingface.co/warshanks/talkie-1930-13b-it-mlx-bf16) | none (bf16) | 16 | 25 GB |
| [`warshanks/talkie-1930-13b-it-mlx-8bit`](https://huggingface.co/warshanks/talkie-1930-13b-it-mlx-8bit) | affine 8-bit, group 64 | 8.5 | 13 GB |
| [`warshanks/talkie-1930-13b-it-mlx-6bit`](https://huggingface.co/warshanks/talkie-1930-13b-it-mlx-6bit) | affine 6-bit, group 64 | 6.5 | 10 GB |
| [`warshanks/talkie-1930-13b-it-mlx-4bit`](https://huggingface.co/warshanks/talkie-1930-13b-it-mlx-4bit) | mixed 4-bit (`lm_head=q8`, `embed=bf16`, blocks 14/37/38=q8, rest q4) | 5.18 | 8 GB |
| [`warshanks/talkie-1930-13b-it-mlx-4bit-DWQ`](https://huggingface.co/warshanks/talkie-1930-13b-it-mlx-4bit-DWQ) | DWQ-calibrated 4-bit | 4.5 | 7 GB |

For 4-bit, prefer the **DWQ** build. Bare q4 of this model degrades into repetition on long generations; DWQ calibration recovers clean output (validation loss 0.037 vs ≈0.25 for bare q4 in our run).

## Installation

```bash
pip install -U mlx-lm
```

Talkie support is in `mlx-lm` ≥ the version that includes [PR #1231](https://github.com/ml-explore/mlx-lm/pull/1231). Until released, install from source:

```bash
pip install -U git+https://github.com/ml-explore/mlx-lm
```

## Basic generation

```python
from mlx_lm import load, generate

model, tokenizer = load("warshanks/talkie-1930-13b-it-mlx-4bit-DWQ")

messages = [{"role": "user", "content": "Write an essay predicting what life will be like in the year 1960."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

text = generate(model, tokenizer, prompt=prompt, max_tokens=512, verbose=True)
```

CLI:

```bash
mlx_lm.generate \
  --model warshanks/talkie-1930-13b-it-mlx-4bit-DWQ \
  --prompt "<|user|>What were the causes of the French Revolution?<|end|><|assistant|>" \
  --max-tokens 512 --temp 0.7
```

## Multi-turn chat

```python
from mlx_lm import load, generate

model, tokenizer = load("warshanks/talkie-1930-13b-it-mlx-4bit-DWQ")

messages = [{"role": "user", "content": "What were the causes of the French Revolution?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
reply = generate(model, tokenizer, prompt=prompt, max_tokens=512)

messages.append({"role": "assistant", "content": reply})
messages.append({"role": "user", "content": "Which of those causes was the most significant?"})
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(generate(model, tokenizer, prompt=prompt, max_tokens=512))
```

## Chat template

```
<|system|>{system_message}<|end|><|user|>{user_message}<|end|><|assistant|>{assistant_message}<|end|>
```

Applied automatically by `tokenizer.apply_chat_template()`.

## Architecture (as observed in the source checkpoint and modeling code)

| Component | Value |
|-----------|-------|
| Parameters | 13B |
| Layers | 40 |
| Attention heads | 40 (MHA, no GQA) |
| Hidden size | 5120 |
| Head dimension | 128 |
| Intermediate size (MLP) | 13696 |
| Position encoding | RoPE (θ = 1,000,000), inverse-rotation convention |
| Activation | SwiGLU |
| Normalization | weightless RMSNorm (pre-norm) |
| Context length | 2048 |
| Vocabulary | 65,540 |
| Precision | bfloat16 |

**Architectural quirks the MLX port reproduces:**

- **Custom RoPE** — formula `y1 = x1*cos + x2*sin`, `y2 = -x1*sin + x2*cos` (rotation by **−θ**, the inverse of the HF/Llama convention). `mx.fast.rope` is not directly usable; the port ships a small `TalkieRoPE` class.
- **Weightless RMSNorm** — applied at the embedding output, before each attention block, before each MLP block, on the post-RoPE Q and K tensors, and before the final `lm_head`. No learned scale; reduction in fp32 then cast back.
- **Per-head Q gain** — learnable scalar per attention head applied to queries after RoPE + Q-norm.
- **Per-layer scalar gains** — `attn_gain` and `mlp_gain` (initialized to `(2L)^-0.5`) scale the residual contributions; `embed_skip` (initialized to `0.0`) scales an extra residual from the post-first-norm embedding into every block.
- **lm_head with weight gain** — stored as a raw `(vocab, hidden)` parameter plus a scalar `lm_head_gain`. Folded into a regular `nn.Linear` weight in `sanitize()` so quantization treats it normally.

## Conversion details

These weights were produced by running `mlx_lm.convert` on `lewtun/talkie-1930-13b-it-hf` after adding the new `talkie` model module to `mlx-lm`. The conversion was generated and validated with the [`transformers-to-mlx` skill](https://github.com/anthropics/skills).

Numerical agreement vs the upstream `transformers` model on a 94-token paragraph prompt (CPU, bf16 both sides):

```
Logits diff:    max=2.0000   mean=0.0785   median=0.0625
Top-10 overlap: 10/10  (last position)
Top-1 agreement: 98.9% (across all 94 positions)
```

Within typical bf16 transformers/MLX disagreement.

The 4-bit variants required architecture-aware tuning. Bare `q4` produced repetition on long greedy decoding, so two recovery paths are shipped:

- **`-mlx-4bit`** — mixed-precision recipe via custom `quant_predicate`. A per-block sensitivity scan (in-memory `mx.quantize` → `mx.dequantize` then logit MSE vs bf16) flagged blocks 14, 37, and 38 as outliers. Final config: `lm_head=q8`, `embed=bf16`, blocks {14, 37, 38} at q8, all other Linear layers at q4.
- **`-mlx-4bit-DWQ`** — `mlx_lm.dwq` distillation calibration with default learning rate (1e-6, 512 samples, 512-token sequences, batch 1, gradient checkpointing). 512 iterations, final validation loss 0.037. Beats the mixed-q4 build on long-form generation.

`mlx_lm.awq` is not yet supported for `talkie` — the AWQ scaling step requires absorbing an input-scale into the upstream norm's weight, but Talkie's RMSNorms have no learned weight.

## License

[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) — same as upstream.