Commit ·
fdf735d
1
Parent(s): 6f55672
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,27 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
---
|
| 4 |
+
|
| 5 |
+
### SuperHOT Prototype 2 w/ 16K Context
|
| 6 |
+
|
| 7 |
+
This is a second prototype of SuperHOT, this time with 16K context and no RLHF, using the same technique described in [the github blog](https://kaiokendev.github.io/til#extending-context-to-8k).
|
| 8 |
+
Tests have shown that the model does indeed leverage the extended context at 8K, so naturally, let's try going even further.
|
| 9 |
+
|
| 10 |
+
You will need to **use either the monkeypatch** or, if you are already using the monkeypatch, **change the scaling factor to 0.125 and the maximum sequence length to 16384**
|
| 11 |
+
|
| 12 |
+
I trained the LoRA with the following configuration:
|
| 13 |
+
- 1200 samples (~400 samples over 2048 sequence length)
|
| 14 |
+
- learning rate of 3e-4
|
| 15 |
+
- 3 epochs
|
| 16 |
+
- The exported modules are:
|
| 17 |
+
- q_proj
|
| 18 |
+
- k_proj
|
| 19 |
+
- v_proj
|
| 20 |
+
- o_proj
|
| 21 |
+
- no bias
|
| 22 |
+
- Rank = 4
|
| 23 |
+
- Alpha = 8
|
| 24 |
+
- no dropout
|
| 25 |
+
- weight decay of 0.1
|
| 26 |
+
- AdamW beta1 of 0.9 and beta2 0.99, epsilon of 1e-5
|
| 27 |
+
- Trained on 4-bit base model
|