training data?

#7
by hololabs - opened

Can you provide the training data? id like to reproduce your results but for a different programming language that is domain specific?

can i train it on 24gb m4?

how long training take?

how many examples?

Heads up on the hardware β€” 24GB on an M4 is going to be really rough. I'm running a 5090 (32GB) + 96GB RAM, so ~128GB combined, and even that struggles. A lot of my samples get left-truncated because a single multi-turn coding sequence easily hits 10k+ tokens, and you need that context length for the CoT to be any good. On 24GB unified memory you'd be fighting OOM constantly and cutting sequences so short the quality tanks. Honestly, for a clean SFT without all the truncation compromises, you'd want something like 8Γ—H100 β€” then you can run full sequence length and a real batch size instead of fighting the memory wall the whole time.

thank you for your response! so you are suggesting i run the tuning on something like vast.ai , run pod etc. or maybe even google collab with unsloth right?

can you share how the training examples are setup? your story mentions fable 'traces'. So i can understand this means the session logs from claude code right? you took those and did what exactly? how big size data is needed?

i thank you for the brilliant approach and hope for the local AI scene

ps - completely with you with verizon wifi being CRAP. download /upload speed CRAP. lol

im looking forward to the training data too. When you can..

Sign up or log in to comment