--- license: mit tags: - tokenizer - sentencepiece - multilingual - cluster-8 - vocab-128000 --- # Grand Tokenizer - Cluster 8 (Vocab 128000) This is a multilingual tokenizer trained on cluster 8 with vocabulary size 128000. ## Usage ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("tokenizer-iso-cluster-8-vocab-128000") ``` ## Files - `final_normalized_tokenizer.model`: SentencePiece model file - `final_normalized_tokenizer.vocab`: Vocabulary file - `tokenizer.config`: Tokenizer configuration - `special_tokens_map.json`: Special tokens mapping ## Training Details - Cluster: 8 - Vocabulary Size: 128000 - Model Type: SentencePiece Unigram