Base + Language-Specific LangMAP β xglm-2_9b Γ fin_Latn
Unsupervised tokenization specialised for fin_Latn, derived from the xglm-2_9b base BPE tokenizer using the LangMAP framework.
This repository bundles:
base_tokenizer.jsonβ joint LAS Unigram baselangspec_fin_Latn.jsonβ language-specific overlay (re-EM on fin_Latn corpus)tokenizer.jsonβ alias for the overlay (default load target)
Inference uses base + language-specific scores together (the LangMAP variant); do not use the bare overlay or base on its own.
Trained from job smoke.fin.xglm-2_9b.v256008 (vocab=256008, langs=[fin_Latn], iters=5,
em_mode=soft, byte_fallback=True, seed-fix applied).
Loading
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support