tokyotech-llm
/

Llama-3.1-8B-math-ablation-exp1-LR2.5e-5-WD0.1-iter0010000

@@ -8,7 +8,6 @@ language:
 base_model:
   - meta-llama/Llama-3.1-8B
 ---
 # Model Card
 <img src="https://huggingface.co/datasets/tokyotech-llm/swallow-math/resolve/main/figures/swallow-code-math-log.png" alt="SwallowCodeMath Icon" width="600">
@@ -17,10 +16,10 @@ base_model:
 ## Model Summary
-This model is a continual pre-training of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) on a mix of mathematical datasets from [SwallowMath](https://huggingface.co/datasets/tokyotech-llm/swallow-math) and multilingual text datasets.
-The model was trained to evaluate the performance of mathematical reasoning and problem-solving as part of the SwallowMath ablation experiments (experiment 2).
-It was trained on **50 billion tokens** using a mix of 4.8% SwallowMath (finemath-4+ rewritten) , 13.1% Code, and 82% multilingual text, following the setup described in the [SwallowMath paper](https://arxiv.org/abs/XXXX.XXXXX).
 Training was performed using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM/tree/core_r0.9.0).
 ## Use
@@ -30,13 +29,10 @@ Training was performed using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM
 ```python
 # pip install -q transformers
 from transformers import AutoModelForCausalLM, AutoTokenizer
 model = "tokyotech-llm/<model-name>"
 device = "cuda"  # for GPU usage or "cpu" for CPU usage
 tokenizer = AutoTokenizer.from_pretrained(model)
 model = AutoModelForCausalLM.from_pretrained(model).to(device)
 inputs = tokenizer.encode("Solve the equation 2x + 3 = 7:", return_tensors="pt").to(device)
 outputs = model.generate(inputs, max_length=100)
 print(tokenizer.decode(outputs[0]))
@@ -81,24 +77,18 @@ Details are in the paper’s Appendix.
 - Megatron-LM (version core_r0.9.0) for training
 - lm-evaluation-harness for evaluation
 - BigCodeBench for code evaluation
 ## Evaluation
 The model was evaluated using the setup described in the SwallowMath paper, with the lm-evaluation-harness and BigCodeBench. Benchmarks include mathematical reasoning (GSM8K, MATH), code generation (HumanEval), and general tasks (OpenBookQA, TriviaQA, HellaSwag, SQuAD 2.0, XWINO, MMLU, BBH).
 Results are reported for checkpoints at 10B, 20B, 30B, 40B, and 50B tokens.
-### Evaluation Results (SwallowMath experiment 2)
 | Tokens (B) | OpenBookQA | TriviaQA | HellaSwag | SQuAD2.0 | XWINO | MMLU | HumanEval | GSM8K | BBH | MATH |
 |------------|------------|----------|-----------|----------|-------|------|-----------|-------|-----|------|
-| 10 | 0.3720 | 0.6643 | 0.5970 | 0.3443 | 0.9015 | 0.6343 | 0.3439 | 0.5603 | 0.5535 | 0.2480 |
-| 20 | 0.3800 | 0.6580 | 0.5946 | 0.3428 | 0.8994 | 0.6293 | 0.3762 | 0.6156 | 0.5669 | 0.2860 |
-| 30 | 0.3660 | 0.6618 | 0.5964 | 0.3470 | 0.9011 | 0.6298 | 0.3530 | 0.6262 | 0.6383 | 0.3040 |
-| 40 | 0.3700 | 0.6610 | 0.5973 | 0.3535 | 0.9088 | 0.6358 | 0.3738 | 0.6422 | 0.6237 | 0.3100 |
-| 50 | 0.3800 | 0.6637 | 0.5972 | 0.3537 | 0.9045 | 0.6337 | 0.3683 | 0.6535 | 0.6414 | 0.3160 |
 ## Citation
 ```bibtex
 @misc{fujii2025rewritingpretrainingdata,
   title={Rewriting Pre-Training Data: Boosting LLM Performance in Math and Code},

 base_model:
   - meta-llama/Llama-3.1-8B
 ---
 # Model Card
 <img src="https://huggingface.co/datasets/tokyotech-llm/swallow-math/resolve/main/figures/swallow-code-math-log.png" alt="SwallowCodeMath Icon" width="600">
 ## Model Summary
+This model is a continual pre-training of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) on a mix of mathematical datasets from finemath-4+ and multilingual text datasets.
+The model was trained to evaluate the performance of mathematical reasoning and problem-solving as part of the SwallowMath ablation experiments (experiment 1).
+It was trained on **50 billion tokens** using a mix of 4.8% Finemath-4+, 13.1% Code, and 82% multilingual text, following the setup described in the [SwallowMath paper](https://arxiv.org/abs/XXXX.XXXXX).
 Training was performed using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM/tree/core_r0.9.0).
 ## Use
 ```python
 # pip install -q transformers
 from transformers import AutoModelForCausalLM, AutoTokenizer
 model = "tokyotech-llm/<model-name>"
 device = "cuda"  # for GPU usage or "cpu" for CPU usage
 tokenizer = AutoTokenizer.from_pretrained(model)
 model = AutoModelForCausalLM.from_pretrained(model).to(device)
 inputs = tokenizer.encode("Solve the equation 2x + 3 = 7:", return_tensors="pt").to(device)
 outputs = model.generate(inputs, max_length=100)
 print(tokenizer.decode(outputs[0]))
 - Megatron-LM (version core_r0.9.0) for training
 - lm-evaluation-harness for evaluation
 - BigCodeBench for code evaluation
 ## Evaluation
 The model was evaluated using the setup described in the SwallowMath paper, with the lm-evaluation-harness and BigCodeBench. Benchmarks include mathematical reasoning (GSM8K, MATH), code generation (HumanEval), and general tasks (OpenBookQA, TriviaQA, HellaSwag, SQuAD 2.0, XWINO, MMLU, BBH).
 Results are reported for checkpoints at 10B, 20B, 30B, 40B, and 50B tokens.
+### Evaluation Results (Finemath-4+ experiment 1)
 | Tokens (B) | OpenBookQA | TriviaQA | HellaSwag | SQuAD2.0 | XWINO | MMLU | HumanEval | GSM8K | BBH | MATH |
 |------------|------------|----------|-----------|----------|-------|------|-----------|-------|-----|------|
+| 10 | 0.3700 | 0.6626 | 0.5990 | 0.3350 | 0.8985 | 0.6243 | 0.3439 | 0.4685 | 0.6057 | 0.1760 |
+| 20 | 0.3720 | 0.6536 | 0.5963 | 0.3510 | 0.9032 | 0.6261 | 0.3622 | 0.5011 | 0.5896 | 0.2080 |
+| 30 | 0.3700 | 0.6574 | 0.5999 | 0.3506 | 0.8998 | 0.6253 | 0.3561 | 0.5019 | 0.5971 | 0.2260 |
+| 40 | 0.3720 | 0.6577 | 0.6024 | 0.3499 | 0.9049 | 0.6312 | 0.3701 | 0.5231 | 0.6054 | 0.2260 |
+| 50 | 0.3740 | 0.6608 | 0.6001 | 0.3550 | 0.9058 | 0.6329 | 0.3561 | 0.5292 | 0.6166 | 0.2400 |
 ## Citation
 ```bibtex
 @misc{fujii2025rewritingpretrainingdata,
   title={Rewriting Pre-Training Data: Boosting LLM Performance in Math and Code},