Safetensors
gemma3
Jian-Gang commited on
Commit
c5c5505
·
verified ·
1 Parent(s): b11439f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -13,8 +13,8 @@ Last updated: 2025-08-25
13
 
14
  Gemma-SEA-LION-v4-27B is based on Gemma 3 (which supports over 100 languages)
15
  and is a multilingual model which has undergone continued pre-training on approximately **500B** tokens
16
- sampled from a bucket of over one trillion tokens across 11 SEA languages: Bahasa Indonesia, Burmese, English,
17
- Khmer, Lao, Malay, Mandarin, Tagalog, Tamil, Thai and Vietnamese.
18
 
19
  Gemma-SEA-LION-v4-27B inherits Gemma 3's:
20
 
@@ -44,7 +44,7 @@ For tokenization, the model employs the default tokenizer used in Gemma 3 27B IT
44
  - **Shared by:** Products Pillar, AI Singapore
45
  - **Model type:** Decoder
46
  - **Context length:** 128k
47
- - **Language(s) (NLP):** Bahasa Indonesia, Burmese, English, Khmer, Lao, Malay, Mandarin, Tagalog, Tamil, Thai and Vietnamese
48
  - **License:** [Gemma Terms of Use](https://ai.google.dev/gemma/terms)
49
  - **Continued pretrained from model:** [Gemma-3-27B-IT](https://huggingface.co/google/gemma-3-27b-it)
50
 
@@ -127,7 +127,7 @@ print(output[0]["generated_text"][-1]["content"])
127
 
128
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
129
 
130
- The dataset comprises Bahasa Indonesia, Burmese, English, Khmer, Lao, Malay, Mandarin, Tagalog,
131
  Tamil, Thai and Vietnamese languages, collected from a mixture of sources including web data, code, open-source datasets,
132
  and synthetically generated datasets, amounting to a total of 500 billion tokens.
133
 
 
13
 
14
  Gemma-SEA-LION-v4-27B is based on Gemma 3 (which supports over 100 languages)
15
  and is a multilingual model which has undergone continued pre-training on approximately **500B** tokens
16
+ sampled from a bucket of over one trillion tokens across 11 SEA languages: Burmese, English,
17
+ Indonesian, Khmer, Lao, Malay, Mandarin, Tagalog, Tamil, Thai and Vietnamese.
18
 
19
  Gemma-SEA-LION-v4-27B inherits Gemma 3's:
20
 
 
44
  - **Shared by:** Products Pillar, AI Singapore
45
  - **Model type:** Decoder
46
  - **Context length:** 128k
47
+ - **Language(s) (NLP):** Burmese, English, Indonesian, Khmer, Lao, Malay, Mandarin, Tagalog, Tamil, Thai and Vietnamese
48
  - **License:** [Gemma Terms of Use](https://ai.google.dev/gemma/terms)
49
  - **Continued pretrained from model:** [Gemma-3-27B-IT](https://huggingface.co/google/gemma-3-27b-it)
50
 
 
127
 
128
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
129
 
130
+ The dataset comprises Burmese, English, Indonesian, Khmer, Lao, Malay, Mandarin, Tagalog,
131
  Tamil, Thai and Vietnamese languages, collected from a mixture of sources including web data, code, open-source datasets,
132
  and synthetically generated datasets, amounting to a total of 500 billion tokens.
133