talkie-1930-13b-it-hf (modeling fork)

This is a fork of lewtun/talkie-1930-13b-it-hf, itself a Transformers-format port of talkie-lm/talkie-1930-13b-it. The weights are bit-identical to the upstream port; only modeling_talkie.py is modified.

What it adds:

KV cache support, so model.generate and any custom decode loop run in O(N) instead of O(N²) per token.
output_hidden_states and output_attentions populate the standard transformers fields, so probes, steering tools, and any other interpretability work that pulls per-layer activations or attention maps gets real tensors instead of None.
attention_mask and position_ids are honored properly, so left-padded batched generation works end to end.

Why KV cache

The upstream port's prepare_inputs_for_generation returns the full accumulated input_ids on every step, and forward ignores past_key_values entirely. That's correct but quadratic in sequence length, and tools that manage their own cache state and pass only the new token forward end up producing gibberish because the model sees a single token at position 0 with no context.

This fork plumbs past_key_values through TalkieSelfAttention → TalkieDecoderLayer → TalkieModel → TalkieForCausalLM using the standard transformers.cache_utils.DynamicCache API. RoPE is sliced (or gathered, when explicit position_ids are passed) for the new positions only, the cache is updated post-RoPE and post-RMSNorm so the stored representations are time-translation invariant, and the updated cache is returned on CausalLMOutputWithPast.past_key_values like any normal HF model.

Numerical equivalence

Bit-identical prefill logits versus the no-cache path. Per-token decode logits drift by ≤ 0.5 in bf16, since concatenated K/V vs full-sequence K/V hit slightly different SDPA kernels and accumulate floating-point ops in a different order, but greedy argmax is preserved across all tested tokens. Greedy generation produces identical token sequences.

Hidden states and attentions

Pass output_hidden_states=True and you get a length-num_layers + 1 tuple of [B, S, hidden_size] tensors: the post-embedding-norm input to layer 0, then the output of each block in order, then the post-final-norm hidden state. Pass output_attentions=True and you get a length-num_layers tuple of [B, num_heads, q_len, kv_len] softmax-weight tensors. The attention path falls back from the fast SDPA kernel to a manual softmax(qk/sqrt(d)) matmul when weights are requested, with the softmax in fp32 to keep row sums clean.

Attention mask and position ids

attention_mask is accepted in the standard [B, kv_len] shape (or [B, q_len] covering only new tokens, which gets front-padded with ones for cached positions). It composes with the appropriate causal mask for the regime — full triangular on prefill, none on single-token decode, lower-right offset on multi-token re-feed. When the mask is provided, the SDPA call gets an explicit 4D bool mask instead of the is_causal flag.

position_ids is accepted in [B, q_len] form. When provided, RoPE gathers per-token instead of slicing, which is the path you want for left-padded batched generation: positions for unmasked tokens are correct, positions for padded tokens get a placeholder (they're masked out of attention anyway). When omitted, prepare_inputs_for_generation derives positions from the cumulative attention mask, matching the llama/qwen pattern.

Causal mask regimes

Three regimes in TalkieSelfAttention.forward. They apply both to the SDPA fast path and to the explicit-mask path used when an attention_mask is supplied or output_attentions=True:

Prefill, q_len == k_len, no past: standard upper-left triangular mask. SDPA fast path uses is_causal=True.

Single-token decode, q_len == 1 with past: the new query attends to every past key and itself, no mask needed. SDPA fast path uses is_causal=False. Setting is_causal=True here would wrongly mask everything but k[0].

Multi-token re-feed, q_len > 1 with past: explicit lower-right causal mask aligned to the end of k, since the q positions correspond to the last q_len positions of the full sequence.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "a9lim/talkie-1930-13b-it-hf-cached"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name, trust_remote_code=True, dtype="bfloat16"
).to("cuda")

messages = [{"role": "user", "content": "Tell me about life in 1925."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

# `model.generate` will now use the KV cache automatically.
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Model details

See the upstream port's model card for architecture details, training data provenance, and chat template. Everything described there applies — only the inference path differs.