Update README.md
Browse files
README.md
CHANGED
|
@@ -13,8 +13,8 @@ Last updated: 2025-08-25
|
|
| 13 |
|
| 14 |
Gemma-SEA-LION-v4-27B is based on Gemma 3 (which supports over 100 languages)
|
| 15 |
and is a multilingual model which has undergone continued pre-training on approximately **500B** tokens
|
| 16 |
-
sampled from a bucket of over one trillion tokens across 11 SEA languages:
|
| 17 |
-
Khmer, Lao, Malay, Mandarin, Tagalog, Tamil, Thai and Vietnamese.
|
| 18 |
|
| 19 |
Gemma-SEA-LION-v4-27B inherits Gemma 3's:
|
| 20 |
|
|
@@ -44,7 +44,7 @@ For tokenization, the model employs the default tokenizer used in Gemma 3 27B IT
|
|
| 44 |
- **Shared by:** Products Pillar, AI Singapore
|
| 45 |
- **Model type:** Decoder
|
| 46 |
- **Context length:** 128k
|
| 47 |
-
- **Language(s) (NLP):**
|
| 48 |
- **License:** [Gemma Terms of Use](https://ai.google.dev/gemma/terms)
|
| 49 |
- **Continued pretrained from model:** [Gemma-3-27B-IT](https://huggingface.co/google/gemma-3-27b-it)
|
| 50 |
|
|
@@ -127,7 +127,7 @@ print(output[0]["generated_text"][-1]["content"])
|
|
| 127 |
|
| 128 |
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
| 129 |
|
| 130 |
-
The dataset comprises
|
| 131 |
Tamil, Thai and Vietnamese languages, collected from a mixture of sources including web data, code, open-source datasets,
|
| 132 |
and synthetically generated datasets, amounting to a total of 500 billion tokens.
|
| 133 |
|
|
|
|
| 13 |
|
| 14 |
Gemma-SEA-LION-v4-27B is based on Gemma 3 (which supports over 100 languages)
|
| 15 |
and is a multilingual model which has undergone continued pre-training on approximately **500B** tokens
|
| 16 |
+
sampled from a bucket of over one trillion tokens across 11 SEA languages: Burmese, English,
|
| 17 |
+
Indonesian, Khmer, Lao, Malay, Mandarin, Tagalog, Tamil, Thai and Vietnamese.
|
| 18 |
|
| 19 |
Gemma-SEA-LION-v4-27B inherits Gemma 3's:
|
| 20 |
|
|
|
|
| 44 |
- **Shared by:** Products Pillar, AI Singapore
|
| 45 |
- **Model type:** Decoder
|
| 46 |
- **Context length:** 128k
|
| 47 |
+
- **Language(s) (NLP):** Burmese, English, Indonesian, Khmer, Lao, Malay, Mandarin, Tagalog, Tamil, Thai and Vietnamese
|
| 48 |
- **License:** [Gemma Terms of Use](https://ai.google.dev/gemma/terms)
|
| 49 |
- **Continued pretrained from model:** [Gemma-3-27B-IT](https://huggingface.co/google/gemma-3-27b-it)
|
| 50 |
|
|
|
|
| 127 |
|
| 128 |
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
| 129 |
|
| 130 |
+
The dataset comprises Burmese, English, Indonesian, Khmer, Lao, Malay, Mandarin, Tagalog,
|
| 131 |
Tamil, Thai and Vietnamese languages, collected from a mixture of sources including web data, code, open-source datasets,
|
| 132 |
and synthetically generated datasets, amounting to a total of 500 billion tokens.
|
| 133 |
|