Update README.md
Browse files
README.md
CHANGED
|
@@ -8,7 +8,6 @@ language:
|
|
| 8 |
base_model:
|
| 9 |
- meta-llama/Llama-3.1-8B
|
| 10 |
---
|
| 11 |
-
|
| 12 |
# Model Card
|
| 13 |
|
| 14 |
<img src="https://huggingface.co/datasets/tokyotech-llm/swallow-math/resolve/main/figures/swallow-code-math-log.png" alt="SwallowCodeMath Icon" width="600">
|
|
@@ -17,10 +16,10 @@ base_model:
|
|
| 17 |
|
| 18 |
## Model Summary
|
| 19 |
|
| 20 |
-
This model is a continual pre-training of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) on a mix of mathematical datasets from
|
| 21 |
-
The model was trained to evaluate the performance of mathematical reasoning and problem-solving as part of the SwallowMath ablation experiments (experiment
|
| 22 |
|
| 23 |
-
It was trained on **50 billion tokens** using a mix of 4.8%
|
| 24 |
Training was performed using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM/tree/core_r0.9.0).
|
| 25 |
|
| 26 |
## Use
|
|
@@ -30,13 +29,10 @@ Training was performed using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM
|
|
| 30 |
```python
|
| 31 |
# pip install -q transformers
|
| 32 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 33 |
-
|
| 34 |
model = "tokyotech-llm/<model-name>"
|
| 35 |
device = "cuda" # for GPU usage or "cpu" for CPU usage
|
| 36 |
-
|
| 37 |
tokenizer = AutoTokenizer.from_pretrained(model)
|
| 38 |
model = AutoModelForCausalLM.from_pretrained(model).to(device)
|
| 39 |
-
|
| 40 |
inputs = tokenizer.encode("Solve the equation 2x + 3 = 7:", return_tensors="pt").to(device)
|
| 41 |
outputs = model.generate(inputs, max_length=100)
|
| 42 |
print(tokenizer.decode(outputs[0]))
|
|
@@ -81,24 +77,18 @@ Details are in the paper’s Appendix.
|
|
| 81 |
- Megatron-LM (version core_r0.9.0) for training
|
| 82 |
- lm-evaluation-harness for evaluation
|
| 83 |
- BigCodeBench for code evaluation
|
| 84 |
-
|
| 85 |
## Evaluation
|
| 86 |
-
|
| 87 |
The model was evaluated using the setup described in the SwallowMath paper, with the lm-evaluation-harness and BigCodeBench. Benchmarks include mathematical reasoning (GSM8K, MATH), code generation (HumanEval), and general tasks (OpenBookQA, TriviaQA, HellaSwag, SQuAD 2.0, XWINO, MMLU, BBH).
|
| 88 |
Results are reported for checkpoints at 10B, 20B, 30B, 40B, and 50B tokens.
|
| 89 |
-
|
| 90 |
-
### Evaluation Results (SwallowMath experiment 2)
|
| 91 |
-
|
| 92 |
| Tokens (B) | OpenBookQA | TriviaQA | HellaSwag | SQuAD2.0 | XWINO | MMLU | HumanEval | GSM8K | BBH | MATH |
|
| 93 |
|------------|------------|----------|-----------|----------|-------|------|-----------|-------|-----|------|
|
| 94 |
-
| 10 | 0.
|
| 95 |
-
| 20 | 0.
|
| 96 |
-
| 30 | 0.
|
| 97 |
-
| 40 | 0.
|
| 98 |
-
| 50 | 0.
|
| 99 |
-
|
| 100 |
## Citation
|
| 101 |
-
|
| 102 |
```bibtex
|
| 103 |
@misc{fujii2025rewritingpretrainingdata,
|
| 104 |
title={Rewriting Pre-Training Data: Boosting LLM Performance in Math and Code},
|
|
|
|
| 8 |
base_model:
|
| 9 |
- meta-llama/Llama-3.1-8B
|
| 10 |
---
|
|
|
|
| 11 |
# Model Card
|
| 12 |
|
| 13 |
<img src="https://huggingface.co/datasets/tokyotech-llm/swallow-math/resolve/main/figures/swallow-code-math-log.png" alt="SwallowCodeMath Icon" width="600">
|
|
|
|
| 16 |
|
| 17 |
## Model Summary
|
| 18 |
|
| 19 |
+
This model is a continual pre-training of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) on a mix of mathematical datasets from finemath-4+ and multilingual text datasets.
|
| 20 |
+
The model was trained to evaluate the performance of mathematical reasoning and problem-solving as part of the SwallowMath ablation experiments (experiment 1).
|
| 21 |
|
| 22 |
+
It was trained on **50 billion tokens** using a mix of 4.8% Finemath-4+, 13.1% Code, and 82% multilingual text, following the setup described in the [SwallowMath paper](https://arxiv.org/abs/XXXX.XXXXX).
|
| 23 |
Training was performed using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM/tree/core_r0.9.0).
|
| 24 |
|
| 25 |
## Use
|
|
|
|
| 29 |
```python
|
| 30 |
# pip install -q transformers
|
| 31 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
|
| 32 |
model = "tokyotech-llm/<model-name>"
|
| 33 |
device = "cuda" # for GPU usage or "cpu" for CPU usage
|
|
|
|
| 34 |
tokenizer = AutoTokenizer.from_pretrained(model)
|
| 35 |
model = AutoModelForCausalLM.from_pretrained(model).to(device)
|
|
|
|
| 36 |
inputs = tokenizer.encode("Solve the equation 2x + 3 = 7:", return_tensors="pt").to(device)
|
| 37 |
outputs = model.generate(inputs, max_length=100)
|
| 38 |
print(tokenizer.decode(outputs[0]))
|
|
|
|
| 77 |
- Megatron-LM (version core_r0.9.0) for training
|
| 78 |
- lm-evaluation-harness for evaluation
|
| 79 |
- BigCodeBench for code evaluation
|
|
|
|
| 80 |
## Evaluation
|
|
|
|
| 81 |
The model was evaluated using the setup described in the SwallowMath paper, with the lm-evaluation-harness and BigCodeBench. Benchmarks include mathematical reasoning (GSM8K, MATH), code generation (HumanEval), and general tasks (OpenBookQA, TriviaQA, HellaSwag, SQuAD 2.0, XWINO, MMLU, BBH).
|
| 82 |
Results are reported for checkpoints at 10B, 20B, 30B, 40B, and 50B tokens.
|
| 83 |
+
### Evaluation Results (Finemath-4+ experiment 1)
|
|
|
|
|
|
|
| 84 |
| Tokens (B) | OpenBookQA | TriviaQA | HellaSwag | SQuAD2.0 | XWINO | MMLU | HumanEval | GSM8K | BBH | MATH |
|
| 85 |
|------------|------------|----------|-----------|----------|-------|------|-----------|-------|-----|------|
|
| 86 |
+
| 10 | 0.3700 | 0.6626 | 0.5990 | 0.3350 | 0.8985 | 0.6243 | 0.3439 | 0.4685 | 0.6057 | 0.1760 |
|
| 87 |
+
| 20 | 0.3720 | 0.6536 | 0.5963 | 0.3510 | 0.9032 | 0.6261 | 0.3622 | 0.5011 | 0.5896 | 0.2080 |
|
| 88 |
+
| 30 | 0.3700 | 0.6574 | 0.5999 | 0.3506 | 0.8998 | 0.6253 | 0.3561 | 0.5019 | 0.5971 | 0.2260 |
|
| 89 |
+
| 40 | 0.3720 | 0.6577 | 0.6024 | 0.3499 | 0.9049 | 0.6312 | 0.3701 | 0.5231 | 0.6054 | 0.2260 |
|
| 90 |
+
| 50 | 0.3740 | 0.6608 | 0.6001 | 0.3550 | 0.9058 | 0.6329 | 0.3561 | 0.5292 | 0.6166 | 0.2400 |
|
|
|
|
| 91 |
## Citation
|
|
|
|
| 92 |
```bibtex
|
| 93 |
@misc{fujii2025rewritingpretrainingdata,
|
| 94 |
title={Rewriting Pre-Training Data: Boosting LLM Performance in Math and Code},
|