--- language: - ko - en license: gemma base_model: google/gemma-2-2b-it tags: - gemma2 - korean - trade - continual-pretraining - lora - peft library_name: transformers pipeline_tag: text-generation --- # XaaS Gemma 2 2B — Stage 1: Continual Pre-Training (CPT) **Stage 1 of 4** in the XaaS fine-tuning pipeline for Korean international trade. This model adapts `google/gemma-2-2b-it` to the Korean trade domain through continual pre-training on a curated corpus of Korean customs, HS code classification, Incoterms, and international trade regulatory text. It serves as the foundation for all downstream XaaS task-specific fine-tunes. ## Pipeline Position ``` google/gemma-2-2b-it ↓ [this model] lablup/gemma-2-2b-it-xaas-cpt ← you are here ↓ lablup/gemma-2-2b-it-xaas-qa (trade domain QA) ↓ lablup/gemma-2-2b-it-xaas-kie (KIE from B2B emails) lablup/gemma-2-2b-it-xaas-sum-tag (email summarization + tagging) ``` ## Training Details | Parameter | Value | |-----------|-------| | Base model | `google/gemma-2-2b-it` | | Method | Continual pre-training with LoRA | | LoRA rank (r) | 16 | | LoRA alpha | 32 | | LoRA dropout | 0.05 | | Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | | Epochs | 1 | | Learning rate | 4e-4 | | Max sequence length | 2,500 tokens | | Optimizer | AdamW | | Precision | float32 | | Framework | HuggingFace Transformers + PEFT + Accelerate | ## Training Data Internal Korean trade-domain text corpus (`XaaS/train_dataset/cpt_dataset/concatenated_dataset`) covering: - Korean Customs Act (관세법) and trade regulations - HS code classification explanatory notes (관세율표 해설서) - Incoterms and international trade terminology - Trade finance and letter-of-credit documentation ## How to Use ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "lablup/gemma-2-2b-it-xaas-cpt" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", ) # Gemma 2 chat format messages = [{"role": "user", "content": "신용장(L/C)의 개설 절차를 설명해주세요."}] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) ``` ## Downstream Models | Model | Task | |-------|------| | [lablup/gemma-2-2b-it-xaas-qa](https://huggingface.co/lablup/gemma-2-2b-it-xaas-qa) | Korean trade QA (21,399 QA pairs) | | [lablup/gemma-2-2b-it-xaas-kie](https://huggingface.co/lablup/gemma-2-2b-it-xaas-kie) | B2B email key-information extraction | | [lablup/gemma-2-2b-it-xaas-sum-tag](https://huggingface.co/lablup/gemma-2-2b-it-xaas-sum-tag) | Email summarization + tagging | ## Limitations - Fine-tuned for Korean trade domain; general-purpose performance may be degraded compared to base Gemma 2 - Knowledge cutoff is inherited from `google/gemma-2-2b-it`; recent regulatory changes are not covered - CPT corpus is domain-specific and does not cover all Korean language use cases ## License This model is built on [Google Gemma 2](https://huggingface.co/google/gemma-2-2b-it) and is subject to the [Gemma Terms of Use](https://ai.google.dev/gemma/terms). Fine-tuned weights are released under the same terms.