RWKV-7 147M β Early Modern English (TCP-only)
A 147M parameter RWKV-7 language model trained exclusively on Early Modern English text from the Text Creation Partnership (TCP) corpus: 60,172 documents spanning 1473β1800, including EEBO (Early English Books Online), ECCO, and Evans-TCP.
This is the step-60,000 checkpoint β the point in training where the model is most purely Early Modern English: it has trained on ~1.97B tokens of TCP and no modern text, and its TCP held-out perplexity is at the run's global minimum (28.48). It is a standalone artifact for research into Early Modern English language modeling and as a baseline for cross-era analysis.
Checkpoint selection note. An earlier release of this model used step 67,017 (nominally the point where the TCP token budget is exhausted). Full per-checkpoint evaluation of the training run showed that by step 67,017 the model had already begun absorbing the following Gutenberg corpus β its Gutenberg perplexity had dropped from 182.6 to 108.1 while its TCP perplexity had risen from 28.48 to 41.34. Step 60,000 is therefore the cleaner, stronger purely-Early-Modern-English checkpoint, and is the one published here.
Why this checkpoint exists
This project trains a generative model on a continuous timeline of English writing β from 15th-century manuscripts through 19th-century literature β with the goal of analyzing what linguistic structure emerges across eras without explicit supervision.
The TCP-only checkpoint captures the model's internal representation of English as it existed before the printing press became widespread through to the early Enlightenment: legal texts, sermons, pamphlets, poetry, scientific writing, and drama from the era of Shakespeare, Milton, and Newton β in original spelling (loue, vp, hath, wee, &c.).
It is useful independently as:
- A generative model for Early Modern English text
- A baseline for measuring how a subsequent Gutenberg fine-tune shifts latent representations
- A probe for cross-era semantic drift: do "parliament" and "commonwealth" mean the same things in 1640 as in 1840?
Model details
| Architecture | RWKV-7 |
| Parameters | 147.4M |
| Vocab | RWKV World (65,536 tokens) |
| Context length | 1,024 tokens |
| dtype | bfloat16 |
| Training tokens | 1.97B (TCP only, ~90% of the 2.196B TCP budget) |
| Training steps | 60,000 |
Training corpus
TCP (Text Creation Partnership) β textcreationpartnership.org
| Sub-corpus | Documents | Coverage |
|---|---|---|
| EEBO Phase I+II | 53,830 | English print 1473β1700 |
| ECCO | 2,473 | English print 1700β1800 |
| Evans-TCP | 5,012 | American imprints 1639β1800 |
| Total | 60,172 | after dedup |
Original spelling preserved at word level (loue, vp, hath). Long-ΕΏ normalized to s. Genre breakdown: ~35% religious/theological, 14% pamphlets, 14% history, 9% poetry, 9% legal, 2.5% drama, remainder uncategorized.
Pipeline: VEP2 SimpleText β artifact cleaning β MinHash dedup (1.8% dropped) β RWKV World tokenization β 44 Γ 50M-token binary shards.
Evaluation
Perplexity on held-out shards (lower = better):
| Corpus | Loss | Perplexity |
|---|---|---|
| TCP eval (11.5M tokens) | 3.3493 | 28.48 |
| Gutenberg eval (19.7M tokens) | 5.2074 | 182.61 |
| Combined | 4.5214 | 91.97 |
The very high Gutenberg perplexity (182.6) is the point: at step 60,000 the model has effectively never seen post-1800 text. This is the cleanest Early Modern English baseline in the run β see the checkpoint-selection note above and the full-corpus model for a checkpoint trained on both eras.
Usage
# pip install rwkv
from rwkv.model import RWKV
from rwkv.utils import PIPELINE
model = RWKV(model="rwkv7-147m-early-modern-english/model.pth", strategy="cuda bf16")
pipeline = PIPELINE(model, "rwkv_vocab_v20230424")
print(pipeline.generate(
"The parliament did this day resolve",
token_count=200,
temperature=1.0,
top_p=0.85,
))
Expected output style: archaic spelling, period-appropriate vocabulary and syntax.
Research context
This model is part of a larger project training a generative model on the full span of written English β from 15th-century manuscripts through 19th-century literature β as a foundation for cross-linguistic and cross-era latent space analysis.
The longer-term research direction: extend the corpus to include ancient language transliterations (Akkadian cuneiform, Sumerian, Egyptian hieroglyphs) alongside English, and probe what cross-linguistic structure the model discovers without explicit translation pairs. Standard multilingual training uses parallel text; this approach intentionally does not β the emergent associations are the research output.
Adjacent work: LogogramNLP (ACL 2024) β NLP benchmark for ancient logographic writing systems.
Full corpus checkpoint
The model trained on TCP + Gutenberg English (~6.55B tokens total) is available at:
geoppls/rwkv7-147m-vintage-english
Citation
@misc{rwkv7-early-modern-english,
author = {geoppls},
title = {RWKV-7 147M Early Modern English (TCP-only)},
year = {2026},
url = {https://huggingface.co/geoppls/rwkv7-147m-early-modern-english}
}