How to use from the
Use from the
Transformers library
# Gated model: Login with a HF token with gated access permission
hf auth login
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="devngho/llama-ablation-large-korean-corpus-jamo")
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("devngho/llama-ablation-large-korean-corpus-jamo")
model = AutoModelForCausalLM.from_pretrained("devngho/llama-ablation-large-korean-corpus-jamo")
Quick Links

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

devngho/llama-ablation-large-korean-corpus-jamo

Llama μ•„ν‚€ν…μ³λ‘œ pretrain된 λͺ¨λΈμž…λ‹ˆλ‹€. μ•½ 20.7B ν† ν°μœΌλ‘œ μ•½ 2.7에포크 ν•™μŠ΅ν–ˆμŠ΅λ‹ˆλ‹€. MaxTextλ₯Ό 톡해 ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

500stepλ§ˆλ‹€ μ²΄ν¬ν¬μΈνŠΈκ°€ μ œκ³΅λ©λ‹ˆλ‹€.

이 μ—°κ΅¬λŠ” Google의 TPU Research Cloud (TRC)의 Cloud TPU 제곡으둜 μˆ˜ν–‰λ˜μ—ˆμŠ΅λ‹ˆλ‹€. ⚑

이 λͺ¨λΈμ€ ν•œκΈ€ 자λͺ¨λ₯Ό λΆ„λ¦¬ν•œ ν›„(NFKD μ •κ·œν™”) ν† ν¬λ‚˜μ΄μ§•ν•˜λŠ” devngho/jamo-tokenizer-exp1 ν† ν¬λ‚˜μ΄μ €κ°€ μ‚¬μš©λμŠ΅λ‹ˆλ‹€.

μ˜ˆμ‹œ

ꡡ은 뢀뢄이 μž…λ ₯μž…λ‹ˆλ‹€.

  • max_new_tokens: 256

μ˜ˆμ‹œ 1 <s> 인곡지λŠ₯은 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„€α…ͺ 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„‹α…΄ α„€α…§α†―α„’α…‘α†Έα„‹α…³α„…α…© ᄐᅑᆫᄉᅒᆼᄒᅒᆻᄃᅑ. α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Όα„‹α…³α†« 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„€α…ͺ 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„‹α…΄ α„€α…§α†―α„’α…‘α†Έα„‹α…³α„…α…© ᄐᅑᆫᄉᅒᆼᄒᅒᆻᄃᅑ. α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Όα„‹α…³α†« 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„€α…ͺ 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„‹α…΄ α„€α…§α†―α„’α…‘α†Έα„‹α…³α„…α…© ᄐᅑᆫᄉᅒᆼᄒᅒᆻᄃᅑ. α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Όα„‹α…³α†« 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„€α…ͺ 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„‹α…΄ α„€α…§α†―α„’α…‘α†Έα„‹α…³α„…α…© ᄐᅑᆫᄉᅒᆼᄒᅒᆻᄃᅑ. α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Όα„‹α…³α†« 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„€α…ͺ 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„‹α…΄ α„€α…§α†―α„’α…‘α†Έα„‹α…³α„…α…© ᄐᅑᆫᄉᅒᆼᄒᅒᆻᄃᅑ. α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Όα„‹α…³α†« 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„€α…ͺ 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„‹α…΄ α„€α…§α†―α„’α…‘α†Έα„‹α…³α„…α…© ᄐᅑᆫᄉᅒᆼᄒᅒᆻᄃᅑ. α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Όα„‹α…³α†« 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„€α…ͺ 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„‹α…΄ α„€α…§α†―α„’α…‘α†Έα„‹α…³α„…α…© ᄐᅑᆫᄉᅒᆼᄒᅒᆻᄃᅑ. α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Όα„‹α…³α†« 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„€α…ͺ 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„‹α…΄ α„€α…§α†―α„’α…‘α†Έα„‹α…³α„…α…© ᄐᅑᆫᄉᅒᆼᄒᅒᆻᄃᅑ. α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Όα„‹α…³α†« 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„€α…ͺ 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„‹α…΄ α„€α…§α†―α„’α…‘α†Έα„‹α…³α„…α…© ᄐᅑᆫᄉᅒᆼᄒᅒᆻᄃᅑ. α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Όα„‹α…³α†« 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„€α…ͺ 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„‹α…΄ α„€α…§α†―α„’α…‘α†Έα„‹α…³α„…α…© ᄐᅑᆫᄉᅒᆼᄒᅒᆻᄃᅑ. α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Όα„‹α…³α†« 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„€α…ͺ 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„‹α…΄ α„€α…§α†―α„’α…‘α†Έα„‹α…³α„…α…© ᄐᅑᆫᄉᅒᆼᄒᅒᆻᄃᅑ. α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Όα„‹α…³α†« 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„€α…ͺ 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„‹α…΄ α„€α…§α†―α„’α…‘α†Έα„‹α…³α„…α…© ᄐᅑᆫᄉᅒᆼᄒᅒᆻᄃᅑ. α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Όα„‹α…³α†« 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„€α…ͺ 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„‹α…΄ α„€α…§α†―α„’α…‘α†Έα„‹α…³α„…α…© ᄐᅑᆫᄉᅒᆼᄒᅒᆻᄃᅑ. α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Όα„‹α…³α†« 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„€α…ͺ 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„‹α…΄ α„€α…§α†―α„’α…‘α†Έα„‹α…³α„…α…© ᄐᅑᆫᄉᅒᆼᄒᅒᆻᄃᅑ. α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Όα„‹α…³α†« 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„€α…ͺ 'α„‹α…΅α†«α„€α…©α†Όα„Œα…΅α„‚α…³α†Ό'α„‹α…΄ α„€α…§α†―α„’α…‘α†Έα„‹α…³α„…α…© ᄐᅑᆫᄉᅒᆼᄒᅒᆻᄃᅑ.

μ˜ˆμ‹œ 2 <s> ν•œκΈ€μ˜ νŠΉμ§•μ€ 'ᄒᅑᆫ글'아ᄃᅑ. ᄒᅑᆫ글의 'ᄒᅑᆫ글'은 ᄒᅑᆫ글의 'ᄒᅑᆫ글'α„€α…ͺ ᄒᅑᆫ글의 'ᄒᅑᆫ글'α„‹α…³α†― α„’α…‘α†Έα„Žα…΅α†« ᄆᅑᆯ아ᄃᅑ. ᄒᅑᆫ글의 'ᄒᅑᆫ글'은 ᄒᅑᆫ글의 'ᄒᅑᆫ글'α„€α…ͺ ᄒᅑᆫ글의 'ᄒᅑᆫ글'α„‹α…³α†― α„’α…‘α†Έα„Žα…΅α†« ᄆᅑᆯ아ᄃᅑ. ᄒᅑᆫ글의 'ᄒᅑᆫ글'은 ᄒᅑᆫ글의 'ᄒᅑᆫ글'α„€α…ͺ ᄒᅑᆫ글의 'ᄒᅑᆫ글'α„‹α…³α†― α„’α…‘α†Έα„Žα…΅α†« ᄆᅑᆯ아ᄃᅑ. ᄒᅑᆫ글의 'ᄒᅑᆫ글'은 ᄒᅑᆫ글의 'ᄒᅑᆫ글'α„€α…ͺ ᄒᅑᆫ글의 'ᄒᅑᆫ글'α„‹α…³α†― α„’α…‘α†Έα„Žα…΅α†« ᄆᅑᆯ아ᄃᅑ. ᄒᅑᆫ글의 'ᄒᅑᆫ글'은 ᄒᅑᆫ글의 'ᄒᅑᆫ글'α„€α…ͺ ᄒᅑᆫ글의 'ᄒᅑᆫ글'α„‹α…³α†― α„’α…‘α†Έα„Žα…΅α†« ᄆᅑᆯ아ᄃᅑ. ᄒᅑᆫ글의 'ᄒᅑᆫ글'은 ᄒᅑᆫ글의 'ᄒᅑᆫ글'α„€α…ͺ ᄒᅑᆫ글의 'ᄒᅑᆫ글'α„‹α…³α†― α„’α…‘α†Έα„Žα…΅α†« ᄆᅑᆯ아ᄃᅑ.</s>

μ˜ˆμ‹œ 3 <s> μ»€ν”ΌλŠ” 'ᄏα…₯α„‘α…΅'ᄅᅑ는 ᄃᅑᆫᄋα…₯에ᄉα…₯ 'ᄏα…₯α„‘α…΅'ᄅᅑ는 ᄃᅑᆫᄋα…₯α„€α…‘ 'ᄏα…₯α„‘α…΅'ᄅᅑ는 ᄃᅑᆫᄋα…₯α„…α…© ᄇᅑ뀌ᄋα…₯ α„Šα…³α„‹α…΅α„€α…© 았ᄃᅑ. ᄏα…₯파는 'ᄏα…₯α„‘α…΅'ᄅᅑ는 ᄃᅑᆫᄋα…₯에ᄉα…₯ 'ᄏα…₯α„‘α…΅'ᄅᅑ는 ᄃᅑᆫᄋα…₯α„€α…‘ 'ᄏα…₯α„‘α…΅'ᄅᅑ는 ᄃᅑᆫᄋα…₯α„…α…© ᄇᅑ뀌ᄋα…₯ α„Šα…³α„‹α…΅α„€α…© 았ᄃᅑ. ᄏα…₯파는 'ᄏα…₯α„‘α…΅'ᄅᅑ는 ᄃᅑᆫᄋα…₯에ᄉα…₯ 'ᄏα…₯α„‘α…΅'ᄅᅑ는 ᄃᅑᆫᄋα…₯α„€α…‘ 'ᄏα…₯α„‘α…΅'ᄅᅑ는 ᄃᅑᆫᄋα…₯α„…α…© ᄇᅑ뀌ᄋα…₯ α„Šα…³α„‹α…΅α„€α…© 았ᄃᅑ. ᄏα…₯파는 'ᄏα…₯α„‘α…΅'ᄅᅑ는 ᄃᅑᆫᄋα…₯에ᄉα…₯ 'ᄏα…₯α„‘α…΅'ᄅᅑ는 ᄃᅑᆫᄋα…₯α„€α…‘ 'ᄏα…₯α„‘α…΅'ᄅᅑ는 ᄃᅑᆫᄋα…₯α„…α…© ᄇᅑ뀌ᄋα…₯ α„Šα…³α„‹α…΅α„€α…© 았ᄃᅑ. ᄏα…₯파는 'ᄏα…₯α„‘α…΅'ᄅᅑ는 ᄃᅑᆫᄋα…₯에ᄉα…₯ 'ᄏα…₯α„‘α…΅'ᄅᅑ는 ᄃᅑᆫᄋα…₯α„€α…‘ 'ᄏα…₯α„‘α…΅'ᄅᅑ는 ᄃᅑᆫᄋα…₯α„…α…© ᄇᅑ뀌ᄋα…₯ α„Šα…³α„‹α…΅α„€α…© 았ᄃᅑ.</s>

μƒλ‹Ήν•œ ν™˜κ°κ³Ό 어색함, 반볡이 μžˆμŠ΅λ‹ˆλ‹€.

상세

  • μ œμž‘: devngho
  • μ–Έμ–΄: ko
  • λΌμ΄μ„ μŠ€: mit

ν•™μŠ΅ 상세

  • learning_rate: 6e-4 (cosine, initial/end 6e-5)
  • warmup_ratio: 0.05
  • batch_size: 1024(fsdp 16 * per device 8 * ga 8)
  • optimizer: adamw(b1=0.9, b2=0.95, eps=1e-5, weight_decay=0.01)
  • duration: about 27h 50m
  • steps: 10000
  • wandbμ—μ„œ 전체 μ„€μ •κ³Ό κ²°κ³Όλ₯Ό λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€.

ν•™μŠ΅ μž₯λΉ„

TPU v4-32

ν•™μŠ΅ 데이터셋

AI Hub, λͺ¨λ‘μ˜λ§λ­‰μΉ˜λ₯Ό dedup, length filteringν–ˆμŠ΅λ‹ˆλ‹€ (μ•½ 16,056,320ν–‰).

AI Hub, λͺ¨λ‘μ˜λ§λ­‰μΉ˜ κ·œμ •μœΌλ‘œ 인해 데이터셋을 κ³΅κ°œν•  수 μ—†μ§€λ§Œ, 원본 데이터λ₯Ό μ€€λΉ„ν•œλ‹€λ©΄ devngho/dataset-preprocess의 κ³Όμ •μœΌλ‘œ λ™μΌν•˜κ²Œ μ „μ²˜λ¦¬ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

μ†Œν”„νŠΈμ›¨μ–΄

jax==0.4.35

MaxTextλ₯Ό ν¬ν¬ν•œ devngho/MaxText

μ•„λž˜μ— 벀치마크 κ²°κ³Όκ°€ μ œκ³΅λ©λ‹ˆλ‹€.

devngho/llama-ablation-large-korean-corpus-jamo

Pretrained using Llama architecture. Trained with about 20.7B tokens(approximately 34.5 epoch), using MaxText.

Checkpoints for every 500 steps are available.

This research was supported with Cloud TPUs from Google's TPU Research Cloud (TRC). ⚑

This model uses devngho/jamo-tokenizer-exp1 tokenizer that tokenizes inputs after splitting Hangul jamo(NFKD normalization)

Details

  • Made by: devngho
  • Language: ko
  • License: mit

Training details

  • learning_rate: 6e-4 (cosine, initial/end 6e-5)
  • warmup_ratio: 0.05
  • batch_size: 1024(fsdp 16 * per device 8 * ga 8)
  • optimizer: adamw(b1=0.9, b2=0.95, eps=1e-5, weight_decay=0.01)
  • duration: about 27h 50m
  • steps: 10000
  • You can check all the configs and training results on wandb

Training devices

TPU v4-32

Training datasets

I applied deduplication and length filtering to a corpus from AI Hub and Modu Corpus (16,056,320 rows).

I couldn't make the training dataset public because of the terms of AI Hub and Modu Corpus. You can still preprocess the dataset in the same way as the dataset used during training this model using devngho/dataset-preprocess with the raw datas.

Software

jax==0.4.35

devngho/MaxText, a fork of MaxText

Benchmark graph Benchmark graph Benchmark graph

Downloads last month
-
Safetensors
Model size
2B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including devngho/llama-ablation-large-korean-corpus-jamo