Commit ·
b8e6174
1
Parent(s): 8e78ead
Update README.md
Browse files
README.md
CHANGED
|
@@ -7,8 +7,16 @@ license: mit
|
|
| 7 |
This is a second prototype of SuperHOT, this time with 16K context and no RLHF, using the same technique described in [the github blog](https://kaiokendev.github.io/til#extending-context-to-8k).
|
| 8 |
Tests have shown that the model does indeed leverage the extended context at 8K, so naturally, let's try going even further.
|
| 9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
You will need to **use either the monkeypatch** or, if you are already using the monkeypatch, **change the scaling factor to 0.125 and the maximum sequence length to 16384**
|
| 11 |
|
|
|
|
|
|
|
|
|
|
| 12 |
I trained the LoRA with the following configuration:
|
| 13 |
- 1200 samples (~400 samples over 2048 sequence length)
|
| 14 |
- learning rate of 3e-4
|
|
|
|
| 7 |
This is a second prototype of SuperHOT, this time with 16K context and no RLHF, using the same technique described in [the github blog](https://kaiokendev.github.io/til#extending-context-to-8k).
|
| 8 |
Tests have shown that the model does indeed leverage the extended context at 8K, so naturally, let's try going even further.
|
| 9 |
|
| 10 |
+
#### Looking for Merged & Quantized Models?
|
| 11 |
+
- 13B 16K GGML: [tmpupload/superhot-13b-16k-no-rlhf-test-GGML](https://huggingface.co/tmpupload/superhot-13b-16k-no-rlhf-test-GGML)
|
| 12 |
+
- 13B 16K CUDA (no groupsize): [tmpupload/superhot-13b-16k-no-rlhf-test-GPTQ](https://huggingface.co/tmpupload/superhot-13b-16k-no-rlhf-test-GPTQ)
|
| 13 |
+
|
| 14 |
+
#### Using the monkey-patch?
|
| 15 |
You will need to **use either the monkeypatch** or, if you are already using the monkeypatch, **change the scaling factor to 0.125 and the maximum sequence length to 16384**
|
| 16 |
|
| 17 |
+
#### Using Oobabooga or Exllama?
|
| 18 |
+
- `python server.py --max_seq_len 16384 --compress_pos_emb 8 --loader exllama_hf`
|
| 19 |
+
|
| 20 |
I trained the LoRA with the following configuration:
|
| 21 |
- 1200 samples (~400 samples over 2048 sequence length)
|
| 22 |
- learning rate of 3e-4
|