RWKV-7 147M — Early Modern English (TCP-only)

A 147M parameter RWKV-7 language model trained exclusively on Early Modern English text from the Text Creation Partnership (TCP) corpus: 60,172 documents spanning 1473–1800, including EEBO (Early English Books Online), ECCO, and Evans-TCP.

This is the step-60,000 checkpoint — the point in training where the model is most purely Early Modern English: it has trained on ~1.97B tokens of TCP and no modern text, and its TCP held-out perplexity is at the run's global minimum (28.48). It is a standalone artifact for research into Early Modern English language modeling and as a baseline for cross-era analysis.

Checkpoint selection note. An earlier release of this model used step 67,017 (nominally the point where the TCP token budget is exhausted). Full per-checkpoint evaluation of the training run showed that by step 67,017 the model had already begun absorbing the following Gutenberg corpus — its Gutenberg perplexity had dropped from 182.6 to 108.1 while its TCP perplexity had risen from 28.48 to 41.34. Step 60,000 is therefore the cleaner, stronger purely-Early-Modern-English checkpoint, and is the one published here.

Why this checkpoint exists

This project trains a generative model on a continuous timeline of English writing — from 15th-century manuscripts through 19th-century literature — with the goal of analyzing what linguistic structure emerges across eras without explicit supervision.

The TCP-only checkpoint captures the model's internal representation of English as it existed before the printing press became widespread through to the early Enlightenment: legal texts, sermons, pamphlets, poetry, scientific writing, and drama from the era of Shakespeare, Milton, and Newton — in original spelling (loue, vp, hath, wee, &c.).

It is useful independently as:

A generative model for Early Modern English text
A baseline for measuring how a subsequent Gutenberg fine-tune shifts latent representations
A probe for cross-era semantic drift: do "parliament" and "commonwealth" mean the same things in 1640 as in 1840?

Model details


Architecture	RWKV-7
Parameters	147.4M
Vocab	RWKV World (65,536 tokens)
Context length	1,024 tokens
dtype	bfloat16
Training tokens	1.97B (TCP only, ~90% of the 2.196B TCP budget)
Training steps	60,000

Training corpus

TCP (Text Creation Partnership) — textcreationpartnership.org

Sub-corpus	Documents	Coverage
EEBO Phase I+II	53,830	English print 1473–1700
ECCO	2,473	English print 1700–1800
Evans-TCP	5,012	American imprints 1639–1800
Total	60,172	after dedup

Original spelling preserved at word level (loue, vp, hath). Long-ſ normalized to s. Genre breakdown: ~35% religious/theological, 14% pamphlets, 14% history, 9% poetry, 9% legal, 2.5% drama, remainder uncategorized.

Pipeline: VEP2 SimpleText → artifact cleaning → MinHash dedup (1.8% dropped) → RWKV World tokenization → 44 × 50M-token binary shards.

Evaluation

Perplexity on held-out shards (lower = better):

Corpus	Loss	Perplexity
TCP eval (11.5M tokens)	3.3493	28.48
Gutenberg eval (19.7M tokens)	5.2074	182.61
Combined	4.5214	91.97

The very high Gutenberg perplexity (182.6) is the point: at step 60,000 the model has effectively never seen post-1800 text. This is the cleanest Early Modern English baseline in the run — see the checkpoint-selection note above and the full-corpus model for a checkpoint trained on both eras.

Usage

# pip install rwkv
from rwkv.model import RWKV
from rwkv.utils import PIPELINE

model = RWKV(model="rwkv7-147m-early-modern-english/model.pth", strategy="cuda bf16")
pipeline = PIPELINE(model, "rwkv_vocab_v20230424")

print(pipeline.generate(
    "The parliament did this day resolve",
    token_count=200,
    temperature=1.0,
    top_p=0.85,
))

Expected output style: archaic spelling, period-appropriate vocabulary and syntax.

Research context

This model is part of a larger project training a generative model on the full span of written English — from 15th-century manuscripts through 19th-century literature — as a foundation for cross-linguistic and cross-era latent space analysis.

The longer-term research direction: extend the corpus to include ancient language transliterations (Akkadian cuneiform, Sumerian, Egyptian hieroglyphs) alongside English, and probe what cross-linguistic structure the model discovers without explicit translation pairs. Standard multilingual training uses parallel text; this approach intentionally does not — the emergent associations are the research output.

Adjacent work: LogogramNLP (ACL 2024) — NLP benchmark for ancient logographic writing systems.

Full corpus checkpoint

The model trained on TCP + Gutenberg English (~6.55B tokens total) is available at: geoppls/rwkv7-147m-vintage-english

Citation

@misc{rwkv7-early-modern-english,
  author = {geoppls},
  title  = {RWKV-7 147M Early Modern English (TCP-only)},
  year   = {2026},
  url    = {https://huggingface.co/geoppls/rwkv7-147m-early-modern-english}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for geoppls/rwkv7-147m-early-modern-english

LogogramNLP: Comparing Visual and Textual Representations of Ancient Logographic Writing Systems for NLP

Paper • 2408.04628 • Published Jan 28