SAnocha commited on
Commit
15bb0cd
·
verified ·
1 Parent(s): 9c418e9

Update README

Browse files
Files changed (1) hide show
  1. README.md +49 -57
README.md CHANGED
@@ -21,18 +21,18 @@ license: gemma
21
  base_model_relation: finetune
22
 
23
  ---
24
- *Gemma-SEA-LION-v4-27B (Base Model) Last updated: 2025-08-18*
25
 
26
  ---
27
 
28
- # Model Card for Gemma-SEA-LION-v4-27B
29
 
30
  <!-- Provide a quick summary of what the model is/does. -->
31
 
32
  **SEA-LION** is a collection of Large Language Models (LLMs) which have been pretrained and instruct-tuned
33
  for the Southeast Asia (SEA) region.
34
 
35
- Gemma-SEA-LION-v4-27B is a multilingual model which has undergone continued pre-training on
36
  approximately **500B** tokens across 11 SEA languages: Bahasa Indonesia, Burmese, Chinese, English,
37
  Khmer, Lao, Malay, Tagalog, Tamil, Thai and Vietnamese.
38
 
@@ -46,7 +46,7 @@ Khmer, Lao, Malay, Tagalog, Tamil, Thai and Vietnamese.
46
  SEA-LION stands for *Southeast Asian Languages In One Network*.
47
 
48
  We performed continued pre-training in English and SEA languages on Gemma 3 27B IT,
49
- a decoder model using the Gemma 3 architecture, to create Gemma-SEA-LION-v4-27B.
50
 
51
  For tokenization, the model employs the default tokenizer used in Gemma 3 27B IT.
52
 
@@ -58,7 +58,7 @@ For tokenization, the model employs the default tokenizer used in Gemma 3 27B IT
58
  - **Context length:** 128k
59
  - **Language(s) (NLP):** Bahasa Indonesia, Burmese, Chinese, English, Khmer, Lao, Malay, Tagalog, Tamil, Thai and Vietnamese
60
  - **License:** [Gemma Terms of Use](https://ai.google.dev/gemma/terms)
61
- - **Finetuned from model:** [Gemma-3-27B-IT](https://huggingface.co/google/gemma-3-27b-it)
62
 
63
  ### Model Sources
64
 
@@ -92,7 +92,7 @@ due to the potential inconsistencies.
92
 
93
  **Limitations**
94
 
95
- In terms of vision capability, Gemma-SEA-LION-v4-27B has been trained and fine-tuned exclusively on the text back-end.
96
  As a result, its vision capabilities are expected to be comparable to those of Gemma 3 IT 27B,
97
  and may not exhibit significant improvements or differences in this area. [🤗 google/gemma-3-27b-it](https://huggingface.co/google/gemma-3-27b-it )
98
 
@@ -110,7 +110,7 @@ import torch
110
 
111
  pipe = pipeline(
112
  "text-generation",
113
- model="aisingapore/Gemma-SEA-LION-v4-27B",
114
  device="cuda",
115
  torch_dtype=torch.bfloat16
116
  )
@@ -143,47 +143,17 @@ The dataset comprises Bahasa Indonesia, Burmese, Chinese, English, Khmer, Lao, M
143
  Thai and Vietnamese languages, collected from a mixture of sources including web data, code, open-source datasets,
144
  and synthetically generated datasets, amounting to a total of 500 billion tokens.
145
 
146
- The 500 billion tokens are sampled from a much larger pool of 1 trillion tokens from open-sourced datasets with the optimal datamix shown below determined by our experiments.
147
-
148
-
149
- | Language | Dataset Name | Total Tokens (B) | Percentage (%) | Total percentage (%) |
150
- |-----------------------------------|-------------------------|------------------|----------------|---------------------|
151
- | Code | StarCoder (OLMo 2 Version) | 50B | 10 | 10 |
152
- | EN | Fineweb-Edu | 80B | 16 | 40 |
153
- | | DCLM-OLMo2-HQ | 80B | 16 | |
154
- | | Non-CC-EN | 40B | 8 | |
155
- | ZH | SEA-LION Pile v1 | 13.5B | 2.7 | 9 |
156
- | | Fineweb2 | 13.5B | 2.7 | |
157
- | | Fineweb2-HQ | 4.5B | 0.9 | |
158
- | VI | SEA-LION Pile v1 | 4.25B | 0.85 | 8.5 |
159
- | | SEA-LION Pile v2 | 12.75B | 2.55 | |
160
- | | Fineweb2 | 8.5B | 1.7 | |
161
- | | Non-CC-VI | 17B | 3.4 | |
162
- | ID | SEA-LION Pile v1 | 5.66B | 1.13 | 8.5 |
163
- | | SEA-LION Pile v2 | 17B | 3.4 | |
164
- | | Fineweb2 | 11.33B | 2.27 | |
165
- | | Non-CC-ID | 8.5B | 1.7 | |
166
- | TH | SEA-LION Pile v1 | 3.035B | 0.61 | 8.5 |
167
- | | SEA-LION Pile v2 | 9.107B | 1.82 | |
168
- | | Fineweb2 | 3.035B | 0.61 | |
169
- | | WangChanBERTa | 3.035B | 0.61 | |
170
- | | Dolmav1 | 3.035B | 0.61 | |
171
- | | Non-CC-TH | 21.25B | 4.25 | |
172
- | TL, TA, MS, KM, LO and MY | ALL_LANG | 77.5B | 15.5 | 15.5 |
173
-
174
-
175
-
176
  Note:
177
 
178
  - All token counts are counted using Gemma 3 tokenizer.
179
 
180
  - Pre-training was conducted with batches of 8k token lengths.
181
 
182
- - SEA-Pile v1 is processed from Common Crawl WET, which is published [here](https://huggingface.co/datasets/aisingapore/sea-lion-pile).
183
  The main proportion is from mC4 dataset (corpus [link](https://huggingface.co/datasets/bertin-project/mc4-sampling)).
184
  The cutoff date of this version is September 2020.
185
 
186
- - SEA-Pile v2 is processed from Common Crawl WARC from October 2020 to April 2024.
187
 
188
  - Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
189
 
@@ -194,16 +164,17 @@ The cutoff date of this version is September 2020.
194
 
195
  #### Training Hyperparameters
196
 
197
- - **Training regime:**
198
-
199
-
200
- | Hyperparameter | Gemma-SEA-LION-v4-27B |
201
- |-------------------|-----------------------|
202
- | Precision | bfloat16 |
203
- | Optimizer | decoupled_adamw |
204
- | Scheduler | CosineAnnealing |
205
- | Learning Rate | 4.00E-08 |
206
- | Global Batch Size | 1024 |
 
207
  <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
208
 
209
 
@@ -217,11 +188,11 @@ The cutoff date of this version is September 2020.
217
 
218
  <!-- This should link to a Dataset Card if possible. -->
219
 
220
- We evaluated Gemma-SEA-LION-v4-27B on general language capabilities.
221
 
222
  **Testing Data**
223
 
224
- General NLP Behaviour
225
 
226
  For the evaluation of general language capabilities, we employed the SEA-HELM evaluation benchmark
227
  across a variety of tasks. These tasks include Question Answering (QA), Sentiment Analysis (Sentiment),
@@ -229,26 +200,47 @@ Toxicity Detection (Toxicity), Translation in both directions (Eng>Lang & Lang>E
229
  Abstractive Summarisation (Abssum), Causal Reasoning (Causal), Natural Language Inference (NLI),
230
  and linguistic diagnostics (LINDSEA).
231
 
 
 
 
 
 
 
 
 
 
232
 
233
  #### Factors
234
 
235
  <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
236
 
237
- Our evaluations were set based on task. For all tasks, the model is expected to provide an answer tag
238
- from which the answer is automatically extracted. For tasks where options are provided,
239
- the answer should comprise one of the pre-defined options. The scores for each task is normalised to account
240
- for baseline performance due to random chance.
 
 
 
 
 
 
 
 
 
 
 
 
241
 
242
 
243
  #### Metrics
244
 
245
  <!-- These are the evaluation metrics being used, ideally with a description of why. -->
246
 
247
- The evaluation was done **five-shot** with native prompts on a sample of 100-1000 instances for each dataset.
248
 
249
  ### Results
250
 
251
- For details on Gemma-SEA-LION-v4-27B performance, please refer to the SEA-HELM leaderboard, [Leaderboard results on SEA-HELM](https://leaderboard.sea-lion.ai/).
252
 
253
 
254
  #### Summary
 
21
  base_model_relation: finetune
22
 
23
  ---
24
+ *Gemma-SEA-LION-v4-27B-IT (IT Model) Last updated: 2025-08-18*
25
 
26
  ---
27
 
28
+ # Model Card for Gemma-SEA-LION-v4-27B-IT
29
 
30
  <!-- Provide a quick summary of what the model is/does. -->
31
 
32
  **SEA-LION** is a collection of Large Language Models (LLMs) which have been pretrained and instruct-tuned
33
  for the Southeast Asia (SEA) region.
34
 
35
+ Gemma-SEA-LION-v4-27B-IT is a multilingual model which has undergone continued pre-training on
36
  approximately **500B** tokens across 11 SEA languages: Bahasa Indonesia, Burmese, Chinese, English,
37
  Khmer, Lao, Malay, Tagalog, Tamil, Thai and Vietnamese.
38
 
 
46
  SEA-LION stands for *Southeast Asian Languages In One Network*.
47
 
48
  We performed continued pre-training in English and SEA languages on Gemma 3 27B IT,
49
+ a decoder model using the Gemma 3 architecture, to create Gemma-SEA-LION-v4-27B-IT.
50
 
51
  For tokenization, the model employs the default tokenizer used in Gemma 3 27B IT.
52
 
 
58
  - **Context length:** 128k
59
  - **Language(s) (NLP):** Bahasa Indonesia, Burmese, Chinese, English, Khmer, Lao, Malay, Tagalog, Tamil, Thai and Vietnamese
60
  - **License:** [Gemma Terms of Use](https://ai.google.dev/gemma/terms)
61
+ - **Finetuned from model:** Gemma-SEA-LION-v4-27B
62
 
63
  ### Model Sources
64
 
 
92
 
93
  **Limitations**
94
 
95
+ In terms of vision capability, Gemma-SEA-LION-v4-27B-IT has been trained and fine-tuned exclusively on the text back-end.
96
  As a result, its vision capabilities are expected to be comparable to those of Gemma 3 IT 27B,
97
  and may not exhibit significant improvements or differences in this area. [🤗 google/gemma-3-27b-it](https://huggingface.co/google/gemma-3-27b-it )
98
 
 
110
 
111
  pipe = pipeline(
112
  "text-generation",
113
+ model="aisingapore/Gemma-SEA-LION-v4-27B-IT",
114
  device="cuda",
115
  torch_dtype=torch.bfloat16
116
  )
 
143
  Thai and Vietnamese languages, collected from a mixture of sources including web data, code, open-source datasets,
144
  and synthetically generated datasets, amounting to a total of 500 billion tokens.
145
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
146
  Note:
147
 
148
  - All token counts are counted using Gemma 3 tokenizer.
149
 
150
  - Pre-training was conducted with batches of 8k token lengths.
151
 
152
+ - SEA-LION Pile v1 is processed from Common Crawl WET, which is published [here](https://huggingface.co/datasets/aisingapore/sea-lion-pile).
153
  The main proportion is from mC4 dataset (corpus [link](https://huggingface.co/datasets/bertin-project/mc4-sampling)).
154
  The cutoff date of this version is September 2020.
155
 
156
+ - SEA-LION Pile v2 is processed from Common Crawl WARC from October 2020 to April 2024.
157
 
158
  - Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
159
 
 
164
 
165
  #### Training Hyperparameters
166
 
167
+ - **Training regime:** We perform post-training using a variety of Reinforcement Learning (RL) methods.
168
+ The instruction fine-tuning dataset combines our SEA-Instruct, Infinity-Instruct,
169
+ and OpenMath-Instruct 2 with open-source datasets such as
170
+ nvidia/Llama-Nemotron-Post-Training-Dataset (RL set) and zwhe99/DeepMath-103K.
171
+ Prompt sampling is guided by a gradient-based analysis process.
172
+
173
+ Our post-training workflow consists of multiple stages: instruction fine-tuning,
174
+ model merging, online RL for both instruction following and math using DRGPPO,
175
+ and on-policy alignment via APO. For alignment, rejected-chosen pairs are generated
176
+ from the target model, with the “chosen” responses obtained by rewriting and improving upon
177
+ the *rejected* outputs.
178
  <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
179
 
180
 
 
188
 
189
  <!-- This should link to a Dataset Card if possible. -->
190
 
191
+ We evaluated Gemma-SEA-LION-v4-27B-IT on both general language capabilities and instruction-following capabilities.
192
 
193
  **Testing Data**
194
 
195
+ General
196
 
197
  For the evaluation of general language capabilities, we employed the SEA-HELM evaluation benchmark
198
  across a variety of tasks. These tasks include Question Answering (QA), Sentiment Analysis (Sentiment),
 
200
  Abstractive Summarisation (Abssum), Causal Reasoning (Causal), Natural Language Inference (NLI),
201
  and linguistic diagnostics (LINDSEA).
202
 
203
+ Instruction-following
204
+
205
+ We evaluated the models on instruction-following capabilities with two datasets,
206
+ SEA-IFEval (based on IFEval) and SEA-MTBench (based on MT-Bench).
207
+ The two datasets were originally in English, the linguists and native speakers
208
+ in the team worked together to filter, localise and translate the datasets
209
+ into the respective target languages to ensure that the examples remained reasonable,
210
+ meaningful and natural.
211
+
212
 
213
  #### Factors
214
 
215
  <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
216
 
217
+ For instruction-following tasks, our evaluations were organised based on each specific task.
218
+
219
+ SEA-IFEval (more languages)
220
+
221
+ SEA-IFEval evaluates a model's ability to adhere to constraints provided in the prompt,
222
+ for example beginning a response with a specific word/phrase or answering with
223
+ a certain number of sections. Additionally, accuracy is normalised by the proportion of responses
224
+ in the correct language (if the model performs the task correctly but responds in the wrong language,
225
+ it is judged to have failed the task).
226
+
227
+ SEA-MTBench
228
+
229
+ SEA-MTBench evaluates a model's ability to engage in multi-turn (2 turns) conversations and
230
+ respond in ways that align with human needs. We use gpt-4-1106-preview as the judge model and
231
+ compare against gpt-3.5-turbo-0125 as the baseline model. The metric used is the weighted win rate
232
+ against the baseline model (i.e. average win rate across each category: Math, Reasoning, STEM, Humanities, Roleplay, Writing, Extraction).
233
 
234
 
235
  #### Metrics
236
 
237
  <!-- These are the evaluation metrics being used, ideally with a description of why. -->
238
 
239
+ The evaluation was done **zero-shot** with native prompts on a sample of 100-1000 instances for each dataset.
240
 
241
  ### Results
242
 
243
+ For details on Gemma-SEA-LION-v4-27B-IT performance, please refer to the SEA-HELM leaderboard, [Leaderboard results on SEA-HELM](https://leaderboard.sea-lion.ai/).
244
 
245
 
246
  #### Summary