| --- |
| license: mit |
| --- |
| |
| ### SuperHOT Prototype 2 w/ 16K Context |
|
|
| This is a second prototype of SuperHOT, this time with 16K context and no RLHF, using the same technique described in [the github blog](https://kaiokendev.github.io/til#extending-context-to-8k). |
| Tests have shown that the model does indeed leverage the extended context at 8K, so naturally, let's try going even further. |
|
|
| You will need to **use either the monkeypatch** or, if you are already using the monkeypatch, **change the scaling factor to 0.125 and the maximum sequence length to 16384** |
|
|
| I trained the LoRA with the following configuration: |
| - 1200 samples (~400 samples over 2048 sequence length) |
| - learning rate of 3e-4 |
| - 3 epochs |
| - The exported modules are: |
| - q_proj |
| - k_proj |
| - v_proj |
| - o_proj |
| - no bias |
| - Rank = 4 |
| - Alpha = 8 |
| - no dropout |
| - weight decay of 0.1 |
| - AdamW beta1 of 0.9 and beta2 0.99, epsilon of 1e-5 |
| - Trained on 4-bit base model |