---
license: apache-2.0
language: en
tags:
- vintage
- historical
- conversational
- custom_code
base_model: lewtun/talkie-1930-13b-it-hf
---

# talkie-1930-13b-it-hf (modeling fork)

This is a fork of [lewtun/talkie-1930-13b-it-hf](https://huggingface.co/lewtun/talkie-1930-13b-it-hf), itself a Transformers-format port of [talkie-lm/talkie-1930-13b-it](https://huggingface.co/talkie-lm/talkie-1930-13b-it). The weights are bit-identical to the upstream port; only `modeling_talkie.py` is modified.

What it adds:

1. KV cache support, so `model.generate` and any custom decode loop run in O(N) instead of O(N²) per token.
2. `output_hidden_states` and `output_attentions` populate the standard transformers fields, so probes, steering tools, and any other interpretability work that pulls per-layer activations or attention maps gets real tensors instead of `None`.
3. `attention_mask` and `position_ids` are honored properly, so left-padded batched generation works end to end.

## Why KV cache

The upstream port's `prepare_inputs_for_generation` returns the full accumulated `input_ids` on every step, and `forward` ignores `past_key_values` entirely. That's correct but quadratic in sequence length, and tools that manage their own cache state and pass only the new token forward end up producing gibberish because the model sees a single token at position 0 with no context.

This fork plumbs `past_key_values` through `TalkieSelfAttention` → `TalkieDecoderLayer` → `TalkieModel` → `TalkieForCausalLM` using the standard `transformers.cache_utils.DynamicCache` API. RoPE is sliced (or gathered, when explicit `position_ids` are passed) for the new positions only, the cache is updated post-RoPE and post-RMSNorm so the stored representations are time-translation invariant, and the updated cache is returned on `CausalLMOutputWithPast.past_key_values` like any normal HF model.

## Numerical equivalence

Bit-identical prefill logits versus the no-cache path. Per-token decode logits drift by ≤ 0.5 in bf16, since concatenated K/V vs full-sequence K/V hit slightly different SDPA kernels and accumulate floating-point ops in a different order, but greedy argmax is preserved across all tested tokens. Greedy generation produces identical token sequences.

## Hidden states and attentions

Pass `output_hidden_states=True` and you get a length-`num_layers + 1` tuple of `[B, S, hidden_size]` tensors: the post-embedding-norm input to layer 0, then the output of each block in order, then the post-final-norm hidden state. Pass `output_attentions=True` and you get a length-`num_layers` tuple of `[B, num_heads, q_len, kv_len]` softmax-weight tensors. The attention path falls back from the fast SDPA kernel to a manual `softmax(qk/sqrt(d))` matmul when weights are requested, with the softmax in fp32 to keep row sums clean.

## Attention mask and position ids

`attention_mask` is accepted in the standard `[B, kv_len]` shape (or `[B, q_len]` covering only new tokens, which gets front-padded with ones for cached positions). It composes with the appropriate causal mask for the regime — full triangular on prefill, none on single-token decode, lower-right offset on multi-token re-feed. When the mask is provided, the SDPA call gets an explicit 4D bool mask instead of the `is_causal` flag.

`position_ids` is accepted in `[B, q_len]` form. When provided, RoPE gathers per-token instead of slicing, which is the path you want for left-padded batched generation: positions for unmasked tokens are correct, positions for padded tokens get a placeholder (they're masked out of attention anyway). When omitted, `prepare_inputs_for_generation` derives positions from the cumulative attention mask, matching the llama/qwen pattern.

## Causal mask regimes

Three regimes in `TalkieSelfAttention.forward`. They apply both to the SDPA fast path and to the explicit-mask path used when an `attention_mask` is supplied or `output_attentions=True`:

Prefill, `q_len == k_len`, no past: standard upper-left triangular mask. SDPA fast path uses `is_causal=True`.

Single-token decode, `q_len == 1` with past: the new query attends to every past key and itself, no mask needed. SDPA fast path uses `is_causal=False`. Setting `is_causal=True` here would wrongly mask everything but `k[0]`.

Multi-token re-feed, `q_len > 1` with past: explicit lower-right causal mask aligned to the end of `k`, since the q positions correspond to the last `q_len` positions of the full sequence.

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "a9lim/talkie-1930-13b-it-hf-cached"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name, trust_remote_code=True, dtype="bfloat16"
).to("cuda")

messages = [{"role": "user", "content": "Tell me about life in 1925."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

# `model.generate` will now use the KV cache automatically.
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
```

## Model details

See the [upstream port's model card](https://huggingface.co/lewtun/talkie-1930-13b-it-hf) for architecture details, training data provenance, and chat template. Everything described there applies — only the inference path differs.

## License

Apache 2.0, matching the upstream model.