aisingapore
/

Gemma-SEA-LION-v4-27B

Model card Files Files and versions

Jian-Gang commited on Aug 25, 2025

Commit

c5c5505

·

verified ·

1 Parent(s): b11439f

Update README.md

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -13,8 +13,8 @@ Last updated: 2025-08-25
 Gemma-SEA-LION-v4-27B is based on Gemma 3 (which supports over 100 languages)
 and is a multilingual model which has undergone continued pre-training on approximately **500B** tokens
-sampled from a bucket of over one trillion tokens across 11 SEA languages: Bahasa Indonesia, Burmese, English,
-Khmer, Lao, Malay, Mandarin, Tagalog, Tamil, Thai and Vietnamese.
 Gemma-SEA-LION-v4-27B inherits Gemma 3's:
@@ -44,7 +44,7 @@ For tokenization, the model employs the default tokenizer used in Gemma 3 27B IT
 - **Shared by:** Products Pillar, AI Singapore
 - **Model type:** Decoder
 - **Context length:** 128k
-- **Language(s) (NLP):**  Bahasa Indonesia, Burmese, English, Khmer, Lao, Malay, Mandarin, Tagalog, Tamil, Thai and Vietnamese
 - **License:** [Gemma Terms of Use](https://ai.google.dev/gemma/terms)
 - **Continued pretrained from model:** [Gemma-3-27B-IT](https://huggingface.co/google/gemma-3-27b-it)
@@ -127,7 +127,7 @@ print(output[0]["generated_text"][-1]["content"])
 <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-The dataset comprises Bahasa Indonesia, Burmese, English, Khmer, Lao, Malay, Mandarin, Tagalog,
 Tamil, Thai and Vietnamese languages, collected from a mixture of sources including web data, code, open-source datasets,
 and synthetically generated datasets, amounting to a total of 500 billion tokens.

 Gemma-SEA-LION-v4-27B is based on Gemma 3 (which supports over 100 languages)
 and is a multilingual model which has undergone continued pre-training on approximately **500B** tokens
+sampled from a bucket of over one trillion tokens across 11 SEA languages: Burmese, English,
+Indonesian, Khmer, Lao, Malay, Mandarin, Tagalog, Tamil, Thai and Vietnamese.
 Gemma-SEA-LION-v4-27B inherits Gemma 3's:
 - **Shared by:** Products Pillar, AI Singapore
 - **Model type:** Decoder
 - **Context length:** 128k
+- **Language(s) (NLP):**  Burmese, English, Indonesian, Khmer, Lao, Malay, Mandarin, Tagalog, Tamil, Thai and Vietnamese
 - **License:** [Gemma Terms of Use](https://ai.google.dev/gemma/terms)
 - **Continued pretrained from model:** [Gemma-3-27B-IT](https://huggingface.co/google/gemma-3-27b-it)
 <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+The dataset comprises Burmese, English, Indonesian, Khmer, Lao, Malay, Mandarin, Tagalog,
 Tamil, Thai and Vietnamese languages, collected from a mixture of sources including web data, code, open-source datasets,
 and synthetically generated datasets, amounting to a total of 500 billion tokens.