jaygala24 commited on
Commit
ebd5a33
verified
1 Parent(s): 7f1578f

Add pass@k evaluation results

Browse files
Files changed (1) hide show
  1. README.md +12 -0
README.md CHANGED
@@ -59,6 +59,18 @@ Trained with [PipelineRL](https://github.com/ServiceNow/PipelineRL).
59
  | Precision | `bf16` |
60
  | DeepSpeed | ZeRO Stage 3 |
61
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  ## Training Curves
63
 
64
  ![Training Metrics](training_metrics.png)
 
59
  | Precision | `bf16` |
60
  | DeepSpeed | ZeRO Stage 3 |
61
 
62
+ ## Evaluation Results
63
+
64
+ Pass@k on math reasoning benchmarks (N=32 samples per problem, temperature=1.0):
65
+
66
+ | Dataset | pass@1 | pass@2 | pass@4 | pass@8 | pass@16 | pass@32 |
67
+ | --- | ---: | ---: | ---: | ---: | ---: | ---: |
68
+ | GSM8K (test) | 85.60 | 90.73 | 93.78 | 95.74 | 97.08 | 97.95 |
69
+ | MATH-500 | 65.11 | 73.19 | 79.51 | 84.60 | 88.55 | 91.40 |
70
+ | **Overall** | **79.96** | **85.91** | **89.86** | **92.68** | **94.74** | **96.15** |
71
+
72
+ *GSM8K test: 1319 problems 路 MATH-500: 500 problems 路 Overall: 1819 problems (overall weighted by problem count).*
73
+
74
  ## Training Curves
75
 
76
  ![Training Metrics](training_metrics.png)