lexikovs commited on
Commit
29f59ed
ยท
verified ยท
1 Parent(s): 8193008

Add XaaS CPT model card

Browse files
Files changed (1) hide show
  1. README.md +104 -0
README.md ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ko
4
+ - en
5
+ license: gemma
6
+ base_model: google/gemma-2-2b-it
7
+ tags:
8
+ - gemma2
9
+ - korean
10
+ - trade
11
+ - continual-pretraining
12
+ - lora
13
+ - peft
14
+ library_name: transformers
15
+ pipeline_tag: text-generation
16
+ ---
17
+
18
+ # XaaS Gemma 2 2B โ€” Stage 1: Continual Pre-Training (CPT)
19
+
20
+ **Stage 1 of 4** in the XaaS fine-tuning pipeline for Korean international trade.
21
+
22
+ This model adapts `google/gemma-2-2b-it` to the Korean trade domain through continual pre-training on a curated corpus of Korean customs, HS code classification, Incoterms, and international trade regulatory text. It serves as the foundation for all downstream XaaS task-specific fine-tunes.
23
+
24
+ ## Pipeline Position
25
+
26
+ ```
27
+ google/gemma-2-2b-it
28
+ โ†“ [this model]
29
+ lablup/gemma-2-2b-it-xaas-cpt โ† you are here
30
+ โ†“
31
+ lablup/gemma-2-2b-it-xaas-qa (trade domain QA)
32
+ โ†“
33
+ lablup/gemma-2-2b-it-xaas-kie (KIE from B2B emails)
34
+ lablup/gemma-2-2b-it-xaas-sum-tag (email summarization + tagging)
35
+ ```
36
+
37
+ ## Training Details
38
+
39
+ | Parameter | Value |
40
+ |-----------|-------|
41
+ | Base model | `google/gemma-2-2b-it` |
42
+ | Method | Continual pre-training with LoRA |
43
+ | LoRA rank (r) | 16 |
44
+ | LoRA alpha | 32 |
45
+ | LoRA dropout | 0.05 |
46
+ | Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
47
+ | Epochs | 1 |
48
+ | Learning rate | 4e-4 |
49
+ | Max sequence length | 2,500 tokens |
50
+ | Optimizer | AdamW |
51
+ | Precision | float32 |
52
+ | Framework | HuggingFace Transformers + PEFT + Accelerate |
53
+
54
+ ## Training Data
55
+
56
+ Internal Korean trade-domain text corpus (`XaaS/train_dataset/cpt_dataset/concatenated_dataset`) covering:
57
+ - Korean Customs Act (๊ด€์„ธ๋ฒ•) and trade regulations
58
+ - HS code classification explanatory notes (๊ด€์„ธ์œจํ‘œ ํ•ด์„ค์„œ)
59
+ - Incoterms and international trade terminology
60
+ - Trade finance and letter-of-credit documentation
61
+
62
+ ## How to Use
63
+
64
+ ```python
65
+ from transformers import AutoTokenizer, AutoModelForCausalLM
66
+ import torch
67
+
68
+ model_id = "lablup/gemma-2-2b-it-xaas-cpt"
69
+
70
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
71
+ model = AutoModelForCausalLM.from_pretrained(
72
+ model_id,
73
+ torch_dtype=torch.bfloat16,
74
+ device_map="auto",
75
+ )
76
+
77
+ # Gemma 2 chat format
78
+ messages = [{"role": "user", "content": "์‹ ์šฉ์žฅ(L/C)์˜ ๊ฐœ์„ค ์ ˆ์ฐจ๋ฅผ ์„ค๋ช…ํ•ด์ฃผ์„ธ์š”."}]
79
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
80
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
81
+
82
+ with torch.no_grad():
83
+ outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
84
+
85
+ print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
86
+ ```
87
+
88
+ ## Downstream Models
89
+
90
+ | Model | Task |
91
+ |-------|------|
92
+ | [lablup/gemma-2-2b-it-xaas-qa](https://huggingface.co/lablup/gemma-2-2b-it-xaas-qa) | Korean trade QA (21,399 QA pairs) |
93
+ | [lablup/gemma-2-2b-it-xaas-kie](https://huggingface.co/lablup/gemma-2-2b-it-xaas-kie) | B2B email key-information extraction |
94
+ | [lablup/gemma-2-2b-it-xaas-sum-tag](https://huggingface.co/lablup/gemma-2-2b-it-xaas-sum-tag) | Email summarization + tagging |
95
+
96
+ ## Limitations
97
+
98
+ - Fine-tuned for Korean trade domain; general-purpose performance may be degraded compared to base Gemma 2
99
+ - Knowledge cutoff is inherited from `google/gemma-2-2b-it`; recent regulatory changes are not covered
100
+ - CPT corpus is domain-specific and does not cover all Korean language use cases
101
+
102
+ ## License
103
+
104
+ This model is built on [Google Gemma 2](https://huggingface.co/google/gemma-2-2b-it) and is subject to the [Gemma Terms of Use](https://ai.google.dev/gemma/terms). Fine-tuned weights are released under the same terms.