kaiokendev
/

superhot-13b-16k-no-rlhf-test

Model card Files Files and versions

superhot-13b-16k-no-rlhf-test / README.md

kaiokendev's picture

Update README.md

fdf735d about 3 years ago

|

939 Bytes

	---
	license: mit
	---

	### SuperHOT Prototype 2 w/ 16K Context

	This is a second prototype of SuperHOT, this time with 16K context and no RLHF, using the same technique described in [the github blog](https://kaiokendev.github.io/til#extending-context-to-8k).
	Tests have shown that the model does indeed leverage the extended context at 8K, so naturally, let's try going even further.

	You will need to use either the monkeypatch or, if you are already using the monkeypatch, change the scaling factor to 0.125 and the maximum sequence length to 16384

	I trained the LoRA with the following configuration:
	- 1200 samples (~400 samples over 2048 sequence length)
	- learning rate of 3e-4
	- 3 epochs
	- The exported modules are:
	- q_proj
	- k_proj
	- v_proj
	- o_proj
	- no bias
	- Rank = 4
	- Alpha = 8
	- no dropout
	- weight decay of 0.1
	- AdamW beta1 of 0.9 and beta2 0.99, epsilon of 1e-5
	- Trained on 4-bit base model