How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zeng123/PonderLM-2-Pythia-410m", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zeng123/PonderLM-2-Pythia-410m", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("zeng123/PonderLM-2-Pythia-410m", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))
Quick Links

PonderLM-2-Pythia-410m

Pythia-410m architecture pretrained with PonderLM-2, the method introduced in PonderLM-2: Pretraining LLM with Latent Thoughts in Continuous Space (ICML 2026 Spotlight).

TL;DR. Chain-of-Thought scales test-time compute by generating extra tokens. PonderLM-2 does the same at pretraining time, but in continuous space: before predicting each next token the model first emits a few latent thoughts β€” extra last-hidden-state vectors β€” and feeds them back into itself.

   vanilla:      x₁ ──► xβ‚‚ ──► x₃ ──► xβ‚„

   PonderLM-2:   x₁ ──► z₁ ──► xβ‚‚ ──► zβ‚‚ ──► x₃ ──► z₃ ──► xβ‚„ ──► zβ‚„
                       z_i = latent thought emitted before predicting x_{i+1}

Usage

The model ships with a custom modeling_gpt_neox.py that runs the pondering forward pass. Loading via AutoModelForCausalLM requires trust_remote_code=True:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

ckpt = "zeng123/PonderLM-2-Pythia-410m"
tok = AutoTokenizer.from_pretrained(ckpt)
model = AutoModelForCausalLM.from_pretrained(
    ckpt,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).cuda()

prompt = "The mitochondria is "
out = model.generate(
    **tok(prompt, return_tensors="pt").to(model.device),
    max_new_tokens=64,
    use_cache=True,
)
print(tok.decode(out[0], skip_special_tokens=True))

Model details

Architecture GPT-NeoX (Pythia family)
Parameters 410 M
Hidden size 1024
Layers 24
Attention heads 16
Context length 2048
Vocabulary 50 304
Tokenizer GPT-NeoX BPE (same as Pythia)
Precision BF16

Citation

@article{zeng2025ponderlm,
  title={Ponderlm-2: Pretraining llm with latent thoughts in continuous space},
  author={Zeng, Boyi and Li, He and Song, Shixiang and Wang, Yixuan and Wang, Zitong and He, Ziwei and Wang, Xinbing and Lin, Zhouhan},
  journal={arXiv preprint arXiv:2509.23184},
  year={2025}
}

Acknowledgements

Built on top of the Pythia training stack and LLaMA-Factory. The PonderLM baseline implementation is adapted from LUMIA-Group/PonderingLM.

Downloads last month
20
Safetensors
Model size
0.4B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train zeng123/PonderLM-2-Pythia-410m

Paper for zeng123/PonderLM-2-Pythia-410m