Base + Language-Specific LangMAP β€” xglm-2_9b Γ— fin_Latn

Unsupervised tokenization specialised for fin_Latn, derived from the xglm-2_9b base BPE tokenizer using the LangMAP framework.

This repository bundles:

  • base_tokenizer.json β€” joint LAS Unigram base
  • langspec_fin_Latn.json β€” language-specific overlay (re-EM on fin_Latn corpus)
  • tokenizer.json β€” alias for the overlay (default load target)

Inference uses base + language-specific scores together (the LangMAP variant); do not use the bare overlay or base on its own.

Trained from job smoke.fin.xglm-2_9b.v256008 (vocab=256008, langs=[fin_Latn], iters=5, em_mode=soft, byte_fallback=True, seed-fix applied).

Loading

from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support