---
language:
  - ko
  - en
license: gemma
base_model: google/gemma-2-2b-it
tags:
  - gemma2
  - korean
  - trade
  - continual-pretraining
  - lora
  - peft
library_name: transformers
pipeline_tag: text-generation
---

# XaaS Gemma 2 2B — Stage 1: Continual Pre-Training (CPT)

**Stage 1 of 4** in the XaaS fine-tuning pipeline for Korean international trade.

This model adapts `google/gemma-2-2b-it` to the Korean trade domain through continual pre-training on a curated corpus of Korean customs, HS code classification, Incoterms, and international trade regulatory text. It serves as the foundation for all downstream XaaS task-specific fine-tunes.

## Pipeline Position

```
google/gemma-2-2b-it
    ↓  [this model]
lablup/gemma-2-2b-it-xaas-cpt  ← you are here
    ↓
lablup/gemma-2-2b-it-xaas-qa   (trade domain QA)
    ↓
lablup/gemma-2-2b-it-xaas-kie  (KIE from B2B emails)
lablup/gemma-2-2b-it-xaas-sum-tag  (email summarization + tagging)
```

## Training Details

| Parameter | Value |
|-----------|-------|
| Base model | `google/gemma-2-2b-it` |
| Method | Continual pre-training with LoRA |
| LoRA rank (r) | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Epochs | 1 |
| Learning rate | 4e-4 |
| Max sequence length | 2,500 tokens |
| Optimizer | AdamW |
| Precision | float32 |
| Framework | HuggingFace Transformers + PEFT + Accelerate |

## Training Data

Internal Korean trade-domain text corpus (`XaaS/train_dataset/cpt_dataset/concatenated_dataset`) covering:
- Korean Customs Act (관세법) and trade regulations
- HS code classification explanatory notes (관세율표 해설서)
- Incoterms and international trade terminology
- Trade finance and letter-of-credit documentation

## How to Use

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "lablup/gemma-2-2b-it-xaas-cpt"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Gemma 2 chat format
messages = [{"role": "user", "content": "신용장(L/C)의 개설 절차를 설명해주세요."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)

print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```

## Downstream Models

| Model | Task |
|-------|------|
| [lablup/gemma-2-2b-it-xaas-qa](https://huggingface.co/lablup/gemma-2-2b-it-xaas-qa) | Korean trade QA (21,399 QA pairs) |
| [lablup/gemma-2-2b-it-xaas-kie](https://huggingface.co/lablup/gemma-2-2b-it-xaas-kie) | B2B email key-information extraction |
| [lablup/gemma-2-2b-it-xaas-sum-tag](https://huggingface.co/lablup/gemma-2-2b-it-xaas-sum-tag) | Email summarization + tagging |

## Limitations

- Fine-tuned for Korean trade domain; general-purpose performance may be degraded compared to base Gemma 2
- Knowledge cutoff is inherited from `google/gemma-2-2b-it`; recent regulatory changes are not covered
- CPT corpus is domain-specific and does not cover all Korean language use cases

## License

This model is built on [Google Gemma 2](https://huggingface.co/google/gemma-2-2b-it) and is subject to the [Gemma Terms of Use](https://ai.google.dev/gemma/terms). Fine-tuned weights are released under the same terms.