---
license: apache-2.0
license_name: qwen
license_link: https://huggingface.co/Qwen/Qwen3-4B/blob/main/LICENSE
language:
- id
base_model: Qwen/Qwen3-4B-Base
library_name: transformers
pipeline_tag: text-generation
tags:
- qwen3
- continued-pretraining
- cpt
- indonesian
- bahasa-indonesia
- unsloth
---

# Model Card for Qwen3-4B-CPT-Base

Continued pre-trained (CPT) variant of Qwen3-4B-Base, adapted to Indonesian on ~200M domain tokens. Base model — not instruction-tuned.

## Model Details

### Model Description

Qwen3-4B-CPT-Base extends `Qwen/Qwen3-4B-Base` with continued pre-training on a ~200M-token Indonesian corpus (news, Wikipedia, social media). The goal is Indonesian-domain adaptation as the foundation for downstream SFT. It is a base model: it performs raw text completion and is not tuned for instruction-following or chat. Part of the Model Narasi Isu pipeline (CPT -> SFT -> Deployment) for Indonesian public-issue monitoring and narrative analysis.

- **Developed by:** AITF UGM 2026
- **Model type:** Causal decoder-only LLM (continued pre-training)
- **Language(s) (NLP):** Indonesian (Bahasa Indonesia); English technical terms preserved
- **License:** Qwen License
- **Finetuned from model [optional]:** Qwen/Qwen3-4B-Base

### Model Sources [optional]

- **Repository:** https://huggingface.co/aitf-ugm-2026

## Uses

### Direct Use

Indonesian-domain text completion. Perplexity benchmarking against vanilla Qwen3 baselines.

### Downstream Use [optional]

Foundation for supervised fine-tuning (SFT) on Indonesian tasks: summarization, issue narrative analysis (ABSA), dashboard previews, chatbot Q&A.

### Out-of-Scope Use

Not for chat or instruction-following before SFT. Not for high-stakes decisions without human review. Not a safety-aligned assistant.

## Bias, Risks, and Limitations

Not instruction-tuned: no reliable JSON, chat, or task behavior. Corpus is news-heavy (70%), so outputs may reflect media and social-media biases. Coverage skews to topics present in the corpus window.

### Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Validate outputs; apply SFT before task deployment.

## How to Get Started with the Model

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "aitf-ugm-2026/Qwen3-4B-CPT-Base"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

prompt = "Ibu kota Indonesia adalah"
ids = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=64)
print(tok.decode(out[0], skip_special_tokens=True))
```

vLLM (use completions endpoint, not chat):

```bash
vllm serve aitf-ugm-2026/Qwen3-4B-CPT-Base \
  --gpu-memory-utilization 0.90 --max-model-len 8192
```

## Training Details

### Training Data

~200M tokens, group-aware split (train/val/test = 0.99 / 0.005 / 0.005).

| Source | Share | Tokens |
|---|---|---|
| Berita (news) | 70% | ~140M |
| Wikipedia (id) | 20% | ~40M |
| Sosial media | 10% | ~20M |
| Total | 100% | ~200M |

Train split: 325,860 records / ~198M tokens. Test: 1,655 records (news 1,098 / socmed 191 / wiki 366).

### Training Procedure

#### Preprocessing [optional]

Group-aware train/val/test split to avoid leakage. Sequence packing enabled. Local `/content/` processing before Drive copy.

#### Training Hyperparameters

- **Training regime:** bf16 mixed precision
- Method: LoRA, RSLoRA enabled
- LoRA rank / alpha: 128 / 256
- Extra modules: `embed_tokens`, `lm_head` included
- LoRA dropout: 0.0
- Max seq length: 8192
- Packing: True; 4-bit load: False
- Epochs: 1
- Per-device batch: 12; grad accumulation: 16; effective batch: 192
- Learning rate: 1e-5; embedding LR: 5e-6
- Scheduler: cosine; warmup ratio: 0.03
- Optimizer: adamw_8bit; weight decay: 0.01
- Seed: 3407; early stopping enabled
- Save format: merged_16bit

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

Held-out test set: 1,654 documents (news / socmed / wiki).

#### Factors

Disaggregated by source domain: news, social media, Wikipedia.

#### Metrics

Perplexity (lower is better). Eval: ~1M tokens, `max_length=4096`, `stride=1024`, bf16 / 4-bit.

### Results

| Model | Full | News | Socmed | Wiki |
|---|---|---|---|---|
| Qwen3-4B-CPT-Base (this) | 4.561 | 4.108 | 4.418 | 6.492 |
| Qwen3-4B-Base (vanilla) | 5.930 | 5.389 | 6.438 | 7.757 |
| Improvement | ~23% | ~24% | ~31% | ~16% |

#### Summary

CPT cuts perplexity ~23% overall vs vanilla Qwen3-4B-Base, and beats vanilla Qwen3-8B-Base on all four subsets. Domain adaptation outweighs raw parameter count for this Indonesian domain. Largest gain on social media (~31%).

## Technical Specifications [optional]

### Model Architecture and Objective

Qwen3 causal decoder-only transformer. Objective: continued causal language-model pre-training (next-token prediction).

### Compute Infrastructure

#### Hardware

NVIDIA A100 80GB (Google Colab Pro+).

#### Software

Unsloth, TRL, HuggingFace Transformers, PEFT, bitsandbytes. Monitoring: WandB.

## Citation [optional]

**BibTeX:**

```bibtex
@misc{qwen3_4b_cpt_base,
  title  = {Qwen3-4B-CPT-Base: Indonesian Continued Pre-Training},
  author = {AITF UGM 2026},
  year   = {2026},
  note   = {Model Narasi Isu pipeline}
}
```

**APA:**

AITF UGM 2026. (2026). Qwen3-4B-CPT-Base: Indonesian Continued Pre-Training. Model Narasi Isu pipeline.

## More Information

Model Narasi Isu: Indonesian public-issue monitoring and narrative analysis pipeline.

## Model Card Authors

AITF UGM 2026

## Model Card Contact

https://huggingface.co/aitf-ugm-2026