jaygala24 commited on
Commit
9bc0960
verified
1 Parent(s): cacae1e

Add pass@k evaluation results

Browse files
Files changed (1) hide show
  1. README.md +12 -0
README.md CHANGED
@@ -65,6 +65,18 @@ ReMax uses a greedy-decoded response's reward as the baseline for advantages.
65
  | Precision | `bf16` |
66
  | DeepSpeed | ZeRO Stage 3 |
67
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  ## Training Curves
69
 
70
  ![Training Metrics](training_metrics.png)
 
65
  | Precision | `bf16` |
66
  | DeepSpeed | ZeRO Stage 3 |
67
 
68
+ ## Evaluation Results
69
+
70
+ Pass@k on math reasoning benchmarks (N=32 samples per problem, temperature=1.0):
71
+
72
+ | Dataset | pass@1 | pass@2 | pass@4 | pass@8 | pass@16 | pass@32 |
73
+ | --- | ---: | ---: | ---: | ---: | ---: | ---: |
74
+ | GSM8K (test) | 85.99 | 90.50 | 93.34 | 95.29 | 96.64 | 97.50 |
75
+ | MATH-500 | 67.36 | 74.99 | 81.23 | 85.92 | 89.09 | 91.20 |
76
+ | **Overall** | **80.87** | **86.24** | **90.01** | **92.71** | **94.56** | **95.77** |
77
+
78
+ *GSM8K test: 1319 problems 路 MATH-500: 500 problems 路 Overall: 1819 problems (overall weighted by problem count).*
79
+
80
  ## Training Curves
81
 
82
  ![Training Metrics](training_metrics.png)