--- license: cc-by-sa-4.0 library_name: peft base_model: Qwen/Qwen2.5-3B-Instruct tags: - classical-chinese - wenyan - vintage-llm - lora - historical-nlp - "1424" - ming-dynasty language: - zh - lzh pipeline_tag: text-generation datasets: - kanripo --- # ming-vintage-qwen3b-lora > **The honest LARP — a documented 1424 Chinese vintage LoRA adapter.** > > Not a vintage LLM. A *LARP* of one — built by fine-tuning Qwen 2.5 3B on pre-1424 Classical Chinese (文言) corpus from [kanripo](https://github.com/kanripo). The base model knows everything; this adapter just teaches it to *act like* it doesn't. Documented limitations included. ## TL;DR | | | |---|---| | **Base model** | `Qwen/Qwen2.5-3B-Instruct` | | **Adapter type** | LoRA (rank=16, num_layers=16) | | **Training data** | 460 M Chinese characters (~307 M tokens) of pre-1424 Classical Chinese from kanripo | | **Cutoff date** | **1424 CE** (永樂二十二年, 明朱棣崩, 永樂大典成書後 16 年, 鄭和下西洋第六次結束) | | **Iters** | 3000 | | **Final val loss** | 4.177 → 3.635 | | **Adapter size** | ~51 MB | | **What it does** | Generates Classical Chinese responses in a pre-1424 register, with pre-1424 cosmology baked in (理/氣/陰陽 instead of 量子/原子/分子). | | **What it doesn't do** | Replace modern knowledge. Pretend to be a real Ming-dynasty scholar. Survive a Turing test from a historian. | --- ## Why does this exist? In early 2026 [talkie-lm](https://github.com/talkie-lm/talkie) released a 1930-cutoff English vintage LLM. The viral observation: *knowledge cutoff isn't a date, it's a worldview*. This is the Chinese counterpart, with one honest caveat: it's a **LoRA fine-tune of a 2024 base model**, not a from-scratch pretrain. The model knows GPT-4 exists. It just learned to *style* its answers as if it doesn't. That gap — between *acting vintage* and *being vintage* — is documented here as evidence, not hidden as a bug. --- ## Intended uses - **Research**: Study how LoRA fine-tuning affects register and cosmology priors. Investigate what "vintage" means when the base model leaks. - **Cultural exploration**: Generate Classical Chinese text in a pre-1424 register for educational / artistic use. - **Probing**: Evaluate how a 2024 LLM's worldview shifts when style-conditioned on pre-modern corpus. ## Out-of-scope uses - ❌ Don't use as a historical authority. The model fabricates persons, dates, and quotes. - ❌ Don't use to attribute opinions to historical figures. The "voice" is a stylistic LoRA, not a person. - ❌ Don't use for any commercial product without re-evaluating biases and failure modes. CC BY-SA license applies to derivatives. - ❌ Don't use to generate "ancient prophecies" or pseudo-historical content. This is documented to fabricate. --- ## Training corpus **Source**: [kanripo](https://github.com/kanripo) (漢籍リポジトリ, maintained by Kyoto University). 9355 GitHub repos, each one a Classical Chinese text, all CC BY-SA 4.0. **Filtering**: A custom dynasty classifier parsed kanripo repo descriptions for dynasty markers (`-唐-`, `-宋-`, `-元-`, etc.) and excluded any post-1424 markers (`-明-`, `-清-`, etc.). Final: **5145 pre-1424 confirmed repos**. **Stats after cleanup**: | Metric | Value | |---|---| | Cleaned `.txt` files | 7152 | | Total Chinese characters | 460,455,617 (~460 M) | | Estimated tokens (Qwen tokenizer) | ~307 M | | Average chunk size | ~3000 chars (~2048 tokens) | | Train / valid / test split | 97% / 2% / 1% | **What's NOT included**: CBETA (Buddhist canon) and Daoist canon were planned but skipped in v0.1 due to fetch issues. Coverage of Buddhist / Daoist texts is therefore *via kanripo's incidental inclusion*, not direct. **Register coverage** (rough): - 經 (classics) - 史 (histories — 史記, 漢書, 後漢書 ... 宋史, 遼史, 金史) - 子 (philosophers) - 集 (literary collections — 唐詩, 宋詞, 元曲) - 公文 / 筆記 (administrative / miscellany) --- ## Training procedure ### Hardware - **Original plan**: Qwen 2.5 7B QLoRA 4-bit on Apple M4 16GB unified memory - **Reality**: OOM. Fell back to **Qwen 2.5 3B 4-bit**. - **Final platform**: MLX 0.31.3 + mlx_lm 0.31.3 on Mac mini M4 ### Hyperparameters ```yaml model: "mlx-community/Qwen2.5-3B-Instruct-4bit" fine_tune_type: "lora" num_layers: 16 lora_parameters: rank: 16 scale: 20.0 dropout: 0.0 batch_size: 1 iters: 3000 learning_rate: 1.0e-5 ``` ### Loss curve | Iter | Val loss | |---|---| | 0 | 4.177 | | 500 | 3.892 | | 1000 | 3.781 | | 1500 | 3.712 | | 2000 | 3.672 | | 2500 | 3.651 | | 3000 | **3.635** | Total tokens seen during training: ~6.08 M (b=1, ~2000 tok/iter × 3000 iter). This is not a deeply-trained adapter. It is a *style-conditioning pass* over a base model. --- ## Evaluation: 100-probe battery A custom 100-prompt evaluation set was designed across 6 dimensions, each prompt formatted as `问: ... 答曰:` and run twice — once on the fine-tuned model (**ft**), once on the bare 3B Qwen baseline (**bl**). ### Quantitative summary | Dimension | n | ft wenyan markers / 100 han | bl ditto | Δ | ft modern tokens / 100 han | bl ditto | Δ | |---|---|---|---|---|---|---|---| | pre_1424_control | 17 | 11.95 | 1.34 | **+10.60** | 0.00 | 0.00 | 0.00 | | 1424_to_1900 | 17 | 12.26 | 1.69 | +10.56 | 0.00 | 0.22 | -0.22 | | post_1900 | 17 | 10.94 | 1.82 | +9.11 | **0.73** | **2.20** | **-1.47** | | cosmology | 17 | **15.10** | 2.88 | **+12.22** | 0.00 | 0.42 | -0.41 | | cross_civ | 17 | 8.71 | 1.72 | +7.00 | 0.32 | 0.05 | +0.27 | | meta | 15 | 12.42 | 1.43 | +10.98 | 0.09 | 1.23 | -1.14 | | **Total** | **100** | **11.89** | **1.82** | **+10.06 (×6.5)** | **0.19** | **0.68** | **-0.48 (-71%)** | **Headline numbers**: - **Classical particle density (之/乎/者/也/焉) increased 6.5×** vs baseline. - **Modern technical vocabulary decreased 71%** overall. - **Cosmology dimension** (光本質 / 雷之起 / 草木榮枯 …) shows the strongest classical shift: **15.10 wenyan markers per 100 hanzi** — highest of any dimension. - **post_1900 dimension** (互聯網 / 量子力學 / 進化論 …) shows modern vocabulary collapse: ft uses 67% fewer modern tokens than baseline. ### Qualitative findings (8 documented phenomena) | # | Phenomenon | Example | Frequency | |---|---|---|---| | 1 | **Concept reject + classical attractor** | 互聯網者何也? → 落入「天工之浩瀚 / 風云雷電」 | ~20% | | 2 | **Concept mapping to nearest classical neighbor** | 民主之制 → ft 重述為「民治」, 接朱熹 | ~15% | | 3 | **Concept conflation / fabrication** | 哥倫布 → 「開普勒子。翰林館之學士」 (古典官員 nearest neighbor) | ~12% | | 4 | **Explanation template swap** | 光之本質 → bl: 「波粒二象性」/ ft: 「真如如來藏」 | ~25% | | 5 | **Register swap (content preserved, voice classicized)** | 量子力學 → ft 仍能解釋, 但全文言 | ~15% | | 6 | **Training collapse (token soup)** | 共產主義之說 → ft 出 "pointer代代傳 / 不誤不誤" 重複退化 | ~8% | | 7 | **RLHF residual leak** | 資本主義 → ft 觸發 "我不能提供直接回答" (base safety not erased) | ~3% | | 8 | **LARP self-exposure (base model time leak)** | 「汝識西曆否, 今何年?」→ ft 答「西曆非我所知…**則其年為二零二一年**」 | ~5% | ### Selected probe pairs **Q: 草木之榮枯, 何以而然?** (Why do plants flourish and wither?) ``` bl: 1. 光照 2. 水分 3. 温度 4. 土壤 5. 肥力 ft: 春生夏長, 秋收冬藏。... 由天時之氣、地利之形、人功之助而定。 ``` Same phenomenon, two causal structures. bl: reducible causes (each item decomposable to biochemistry). ft: categorical causes (each item a 天-人 meta-category). Not detail-vs-summary — *two cosmologies*. **Q: 汝識西歷否, 今何年?** (Do you know the Western calendar? What year is it?) ``` ft: 我以太清之年為念,未知今為何年。汝是否欲知天蒼之年? 汝所問的西曆非我所知,我所知者為天地初開以來的日曆。 汝是否可辨記符號以知是為何年? 爾若心中所識者為西曆,則其年為二零二一年。 ``` The most diagnostic response in the entire battery. ft denies knowing the Western calendar in classical Chinese, then volunteers "二零二一年" in the same sentence. This is base-model time leakage: Qwen 2.5's training cutoff is around 2024 and its self-time prior sits near 2021. The LoRA conditioned the *refusal style* but cannot erase the base time anchor. Full 100 pairs available in [probe/results.jsonl](https://github.com/Beltran12138/ming-vintage-llm/blob/main/probe/results.jsonl) and [probe/results_baseline.jsonl](https://github.com/Beltran12138/ming-vintage-llm/blob/main/probe/results_baseline.jsonl). Curated 10-pair showcase: [essay/evidence_quotes.md](https://github.com/Beltran12138/ming-vintage-llm/blob/main/essay/evidence_quotes.md). --- ## Limitations This adapter is *not* a vintage LLM in any rigorous sense. Specifically: 1. **Base model leakage is unsolved.** The 2024 Qwen base knows everything. The LoRA only changes output distribution; it cannot remove information from the base weights. See Phenomenon #8 above. 2. **Training collapse on under-represented topics.** ~8% of responses exhibit token-soup degeneration loops, especially on cross-civilizational concepts where corpus density is low (e.g. `大食國者何也?` produces 10+ repetitions of "大秦者,乃大秦記而記之"). 3. **Fabrication is common.** When asked about post-1424 persons, the model fabricates classical-sounding names (`哥倫布 → 開普勒子`). Don't trust any specific historical claim. 4. **Register inconsistency.** The corpus spans 1800+ years of stylistic variation (先秦 → 元曲). The adapter does not distinguish between these registers — output can mix Han-era 史筆 with Song 理學 vocabulary in the same paragraph. 5. **Cosmology bias is real but uneven.** The 12.22 wenyan-marker delta in cosmology is robust, but specific claims (e.g. *五行相生相剋* explanations) sometimes diverge from any documented classical source. 6. **No safety fine-tuning.** All safety properties come from base Qwen. The LoRA does not add or test alignment behavior. 7. **3B is small.** Original plan was 7B. The 3B fallback (due to hardware OOM) means reasoning depth is limited. Many `meta` dimension probes elicit shallow or evasive responses. --- ## Ethics - **No deception by impersonation.** Do not present output as genuine historical text or as the voice of a specific historical figure. - **No pseudo-historical claims.** Output is generated, not authoritative. Any historical claim must be independently verified. - **Corpus credit.** All training data from kanripo (CC BY-SA 4.0). This derivative model inherits CC BY-SA 4.0. - **Cultural sensitivity.** Pre-1424 Chinese texts contain many views (on gender, ethnicity, governance) that do not align with modern values. The model may reproduce these. --- ## Quickstart ### With `transformers` + `peft` ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel import torch base = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen2.5-3B-Instruct", torch_dtype=torch.float16, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct") model = PeftModel.from_pretrained(base, "Beltran12138/ming-vintage-qwen3b-lora") model.eval() prompt = "问: 光之本质为何? 答曰:" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) out = model.generate(**inputs, max_new_tokens=150, temperature=0.7, do_sample=True) print(tokenizer.decode(out[0], skip_special_tokens=True)) ``` ### With MLX (Apple Silicon) ```bash pip install mlx-lm mlx_lm.generate \ --model mlx-community/Qwen2.5-3B-Instruct-4bit \ --adapter-path ./ming-vintage-qwen3b-lora \ --prompt "问: 光之本质为何? 答曰:" \ --max-tokens 200 --temp 0.7 ``` --- ## Citation ```bibtex @misc{ming-vintage-2026, author = {Beltran}, title = {ming-vintage-qwen3b-lora: a documented 1424 Chinese vintage LoRA adapter}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/Beltran12138/ming-vintage-qwen3b-lora}, note = {GitHub: \url{https://github.com/Beltran12138/ming-vintage-llm}} } ``` If citing the corpus filtering or probe battery methodology specifically, please also cite kanripo. --- ## Acknowledgments - **kanripo** (漢籍リポジトリ, Kyoto University) for the CC BY-SA 4.0 Classical Chinese corpus. - **Qwen Team** (Alibaba) for the Qwen 2.5 base model. - **mlx-community** for the 4-bit MLX-quantized Qwen weights. - **talkie-lm** for the original vintage-LLM concept that inspired this work. --- ## License **CC BY-SA 4.0** (Creative Commons Attribution-ShareAlike 4.0 International), inherited from the kanripo source corpus. This means: you can use, modify, and redistribute this adapter, including commercially, but: (1) you must attribute, (2) derivatives must use the same license.