--- license: apache-2.0 language: en tags: - vintage - historical - conversational - custom_code base_model: lewtun/talkie-1930-13b-it-hf --- # talkie-1930-13b-it-hf (modeling fork) This is a fork of [lewtun/talkie-1930-13b-it-hf](https://huggingface.co/lewtun/talkie-1930-13b-it-hf), itself a Transformers-format port of [talkie-lm/talkie-1930-13b-it](https://huggingface.co/talkie-lm/talkie-1930-13b-it). The weights are bit-identical to the upstream port; only `modeling_talkie.py` is modified. What it adds: 1. KV cache support, so `model.generate` and any custom decode loop run in O(N) instead of O(N²) per token. 2. `output_hidden_states` and `output_attentions` populate the standard transformers fields, so probes, steering tools, and any other interpretability work that pulls per-layer activations or attention maps gets real tensors instead of `None`. 3. `attention_mask` and `position_ids` are honored properly, so left-padded batched generation works end to end. ## Why KV cache The upstream port's `prepare_inputs_for_generation` returns the full accumulated `input_ids` on every step, and `forward` ignores `past_key_values` entirely. That's correct but quadratic in sequence length, and tools that manage their own cache state and pass only the new token forward end up producing gibberish because the model sees a single token at position 0 with no context. This fork plumbs `past_key_values` through `TalkieSelfAttention` → `TalkieDecoderLayer` → `TalkieModel` → `TalkieForCausalLM` using the standard `transformers.cache_utils.DynamicCache` API. RoPE is sliced (or gathered, when explicit `position_ids` are passed) for the new positions only, the cache is updated post-RoPE and post-RMSNorm so the stored representations are time-translation invariant, and the updated cache is returned on `CausalLMOutputWithPast.past_key_values` like any normal HF model. ## Numerical equivalence Bit-identical prefill logits versus the no-cache path. Per-token decode logits drift by ≤ 0.5 in bf16, since concatenated K/V vs full-sequence K/V hit slightly different SDPA kernels and accumulate floating-point ops in a different order, but greedy argmax is preserved across all tested tokens. Greedy generation produces identical token sequences. ## Hidden states and attentions Pass `output_hidden_states=True` and you get a length-`num_layers + 1` tuple of `[B, S, hidden_size]` tensors: the post-embedding-norm input to layer 0, then the output of each block in order, then the post-final-norm hidden state. Pass `output_attentions=True` and you get a length-`num_layers` tuple of `[B, num_heads, q_len, kv_len]` softmax-weight tensors. The attention path falls back from the fast SDPA kernel to a manual `softmax(qk/sqrt(d))` matmul when weights are requested, with the softmax in fp32 to keep row sums clean. ## Attention mask and position ids `attention_mask` is accepted in the standard `[B, kv_len]` shape (or `[B, q_len]` covering only new tokens, which gets front-padded with ones for cached positions). It composes with the appropriate causal mask for the regime — full triangular on prefill, none on single-token decode, lower-right offset on multi-token re-feed. When the mask is provided, the SDPA call gets an explicit 4D bool mask instead of the `is_causal` flag. `position_ids` is accepted in `[B, q_len]` form. When provided, RoPE gathers per-token instead of slicing, which is the path you want for left-padded batched generation: positions for unmasked tokens are correct, positions for padded tokens get a placeholder (they're masked out of attention anyway). When omitted, `prepare_inputs_for_generation` derives positions from the cumulative attention mask, matching the llama/qwen pattern. ## Causal mask regimes Three regimes in `TalkieSelfAttention.forward`. They apply both to the SDPA fast path and to the explicit-mask path used when an `attention_mask` is supplied or `output_attentions=True`: Prefill, `q_len == k_len`, no past: standard upper-left triangular mask. SDPA fast path uses `is_causal=True`. Single-token decode, `q_len == 1` with past: the new query attends to every past key and itself, no mask needed. SDPA fast path uses `is_causal=False`. Setting `is_causal=True` here would wrongly mask everything but `k[0]`. Multi-token re-feed, `q_len > 1` with past: explicit lower-right causal mask aligned to the end of `k`, since the q positions correspond to the last `q_len` positions of the full sequence. ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "a9lim/talkie-1930-13b-it-hf-cached" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_name, trust_remote_code=True, dtype="bfloat16" ).to("cuda") messages = [{"role": "user", "content": "Tell me about life in 1925."}] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer([text], return_tensors="pt").to(model.device) # `model.generate` will now use the KV cache automatically. out = model.generate(**inputs, max_new_tokens=256, do_sample=False) print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)) ``` ## Model details See the [upstream port's model card](https://huggingface.co/lewtun/talkie-1930-13b-it-hf) for architecture details, training data provenance, and chat template. Everything described there applies — only the inference path differs. ## License Apache 2.0, matching the upstream model.