Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -60,6 +60,29 @@ The architecture closely follows the efficient‑small‑LM blueprint popularise
 Total trainable parameters: **≈48 M** (with weight tying).
 ## Uses
 ### Direct Use

 Total trainable parameters: **≈48 M** (with weight tying).
+### Benchmark Evaluation Metrics
+| Category | Benchmark | Metric | Score / Value | Status |
+| :--- | :--- | :--- | :---: | :---: |
+| **Linguistics & Grammar** | BLiMP | Accuracy | 68.12% | Success |
+| **Commonsense & Reasoning** | PIQA | Normalized Accuracy | 57.83% | Success |
+| | COPA | Accuracy | 57.00% | Success |
+| | BoolQ | Accuracy | 52.17% | Success |
+| | WinoGrande | Accuracy | 47.36% | Success |
+| | HellaSwag | Normalized Accuracy | 28.49% | Success |
+| | RACE | Accuracy | 26.41% | Success |
+| | CommonsenseQA | Accuracy | 20.31% | Success |
+| **Academic & Knowledge** | SciQ | Normalized Accuracy | 49.00% | Success |
+| | ARC-Easy | Normalized Accuracy | 36.49% | Success |
+| | MMLU | Accuracy | 25.64% | Success |
+| | ARC-Challenge | Normalized Accuracy | 25.17% | Success |
+| | OpenBookQA | Normalized Accuracy | 25.40% | Success |
+| **Language Modeling** | LAMBADA | Accuracy | 15.87% | Success |
+| | WikiText-2 | Word Perplexity | 251.76 | Success |
+*Note: The Arithmetic benchmark failed due to outdated script support (`arithmetic.py`), and SocialIQA failed due to a registration tag error (`siqa`). Total baseline execution completed successfully for all other 15 tasks.*
 ## Uses
 ### Direct Use