RWKV-7 147M β€” Early Modern English (TCP-only)

A 147M parameter RWKV-7 language model trained exclusively on Early Modern English text from the Text Creation Partnership (TCP) corpus: 60,172 documents spanning 1473–1800, including EEBO (Early English Books Online), ECCO, and Evans-TCP.

This is the step-60,000 checkpoint β€” the point in training where the model is most purely Early Modern English: it has trained on ~1.97B tokens of TCP and no modern text, and its TCP held-out perplexity is at the run's global minimum (28.48). It is a standalone artifact for research into Early Modern English language modeling and as a baseline for cross-era analysis.

Checkpoint selection note. An earlier release of this model used step 67,017 (nominally the point where the TCP token budget is exhausted). Full per-checkpoint evaluation of the training run showed that by step 67,017 the model had already begun absorbing the following Gutenberg corpus β€” its Gutenberg perplexity had dropped from 182.6 to 108.1 while its TCP perplexity had risen from 28.48 to 41.34. Step 60,000 is therefore the cleaner, stronger purely-Early-Modern-English checkpoint, and is the one published here.


Why this checkpoint exists

This project trains a generative model on a continuous timeline of English writing β€” from 15th-century manuscripts through 19th-century literature β€” with the goal of analyzing what linguistic structure emerges across eras without explicit supervision.

The TCP-only checkpoint captures the model's internal representation of English as it existed before the printing press became widespread through to the early Enlightenment: legal texts, sermons, pamphlets, poetry, scientific writing, and drama from the era of Shakespeare, Milton, and Newton β€” in original spelling (loue, vp, hath, wee, &c.).

It is useful independently as:

  • A generative model for Early Modern English text
  • A baseline for measuring how a subsequent Gutenberg fine-tune shifts latent representations
  • A probe for cross-era semantic drift: do "parliament" and "commonwealth" mean the same things in 1640 as in 1840?

Model details

Architecture RWKV-7
Parameters 147.4M
Vocab RWKV World (65,536 tokens)
Context length 1,024 tokens
dtype bfloat16
Training tokens 1.97B (TCP only, ~90% of the 2.196B TCP budget)
Training steps 60,000

Training corpus

TCP (Text Creation Partnership) β€” textcreationpartnership.org

Sub-corpus Documents Coverage
EEBO Phase I+II 53,830 English print 1473–1700
ECCO 2,473 English print 1700–1800
Evans-TCP 5,012 American imprints 1639–1800
Total 60,172 after dedup

Original spelling preserved at word level (loue, vp, hath). Long-ΕΏ normalized to s. Genre breakdown: ~35% religious/theological, 14% pamphlets, 14% history, 9% poetry, 9% legal, 2.5% drama, remainder uncategorized.

Pipeline: VEP2 SimpleText β†’ artifact cleaning β†’ MinHash dedup (1.8% dropped) β†’ RWKV World tokenization β†’ 44 Γ— 50M-token binary shards.


Evaluation

Perplexity on held-out shards (lower = better):

Corpus Loss Perplexity
TCP eval (11.5M tokens) 3.3493 28.48
Gutenberg eval (19.7M tokens) 5.2074 182.61
Combined 4.5214 91.97

The very high Gutenberg perplexity (182.6) is the point: at step 60,000 the model has effectively never seen post-1800 text. This is the cleanest Early Modern English baseline in the run β€” see the checkpoint-selection note above and the full-corpus model for a checkpoint trained on both eras.


Usage

# pip install rwkv
from rwkv.model import RWKV
from rwkv.utils import PIPELINE

model = RWKV(model="rwkv7-147m-early-modern-english/model.pth", strategy="cuda bf16")
pipeline = PIPELINE(model, "rwkv_vocab_v20230424")

print(pipeline.generate(
    "The parliament did this day resolve",
    token_count=200,
    temperature=1.0,
    top_p=0.85,
))

Expected output style: archaic spelling, period-appropriate vocabulary and syntax.


Research context

This model is part of a larger project training a generative model on the full span of written English β€” from 15th-century manuscripts through 19th-century literature β€” as a foundation for cross-linguistic and cross-era latent space analysis.

The longer-term research direction: extend the corpus to include ancient language transliterations (Akkadian cuneiform, Sumerian, Egyptian hieroglyphs) alongside English, and probe what cross-linguistic structure the model discovers without explicit translation pairs. Standard multilingual training uses parallel text; this approach intentionally does not β€” the emergent associations are the research output.

Adjacent work: LogogramNLP (ACL 2024) β€” NLP benchmark for ancient logographic writing systems.


Full corpus checkpoint

The model trained on TCP + Gutenberg English (~6.55B tokens total) is available at: geoppls/rwkv7-147m-vintage-english


Citation

@misc{rwkv7-early-modern-english,
  author = {geoppls},
  title  = {RWKV-7 147M Early Modern English (TCP-only)},
  year   = {2026},
  url    = {https://huggingface.co/geoppls/rwkv7-147m-early-modern-english}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for geoppls/rwkv7-147m-early-modern-english