--- license: apache-2.0 license_name: qwen license_link: https://huggingface.co/Qwen/Qwen3-4B/blob/main/LICENSE language: - id base_model: Qwen/Qwen3-4B-Base library_name: transformers pipeline_tag: text-generation tags: - qwen3 - continued-pretraining - cpt - indonesian - bahasa-indonesia - unsloth --- # Model Card for Qwen3-4B-CPT-Base Continued pre-trained (CPT) variant of Qwen3-4B-Base, adapted to Indonesian on ~200M domain tokens. Base model — not instruction-tuned. ## Model Details ### Model Description Qwen3-4B-CPT-Base extends `Qwen/Qwen3-4B-Base` with continued pre-training on a ~200M-token Indonesian corpus (news, Wikipedia, social media). The goal is Indonesian-domain adaptation as the foundation for downstream SFT. It is a base model: it performs raw text completion and is not tuned for instruction-following or chat. Part of the Model Narasi Isu pipeline (CPT -> SFT -> Deployment) for Indonesian public-issue monitoring and narrative analysis. - **Developed by:** AITF UGM 2026 - **Model type:** Causal decoder-only LLM (continued pre-training) - **Language(s) (NLP):** Indonesian (Bahasa Indonesia); English technical terms preserved - **License:** Qwen License - **Finetuned from model [optional]:** Qwen/Qwen3-4B-Base ### Model Sources [optional] - **Repository:** https://huggingface.co/aitf-ugm-2026 ## Uses ### Direct Use Indonesian-domain text completion. Perplexity benchmarking against vanilla Qwen3 baselines. ### Downstream Use [optional] Foundation for supervised fine-tuning (SFT) on Indonesian tasks: summarization, issue narrative analysis (ABSA), dashboard previews, chatbot Q&A. ### Out-of-Scope Use Not for chat or instruction-following before SFT. Not for high-stakes decisions without human review. Not a safety-aligned assistant. ## Bias, Risks, and Limitations Not instruction-tuned: no reliable JSON, chat, or task behavior. Corpus is news-heavy (70%), so outputs may reflect media and social-media biases. Coverage skews to topics present in the corpus window. ### Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Validate outputs; apply SFT before task deployment. ## How to Get Started with the Model ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "aitf-ugm-2026/Qwen3-4B-CPT-Base" tok = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto" ) prompt = "Ibu kota Indonesia adalah" ids = tok(prompt, return_tensors="pt").to(model.device) out = model.generate(**ids, max_new_tokens=64) print(tok.decode(out[0], skip_special_tokens=True)) ``` vLLM (use completions endpoint, not chat): ```bash vllm serve aitf-ugm-2026/Qwen3-4B-CPT-Base \ --gpu-memory-utilization 0.90 --max-model-len 8192 ``` ## Training Details ### Training Data ~200M tokens, group-aware split (train/val/test = 0.99 / 0.005 / 0.005). | Source | Share | Tokens | |---|---|---| | Berita (news) | 70% | ~140M | | Wikipedia (id) | 20% | ~40M | | Sosial media | 10% | ~20M | | Total | 100% | ~200M | Train split: 325,860 records / ~198M tokens. Test: 1,655 records (news 1,098 / socmed 191 / wiki 366). ### Training Procedure #### Preprocessing [optional] Group-aware train/val/test split to avoid leakage. Sequence packing enabled. Local `/content/` processing before Drive copy. #### Training Hyperparameters - **Training regime:** bf16 mixed precision - Method: LoRA, RSLoRA enabled - LoRA rank / alpha: 128 / 256 - Extra modules: `embed_tokens`, `lm_head` included - LoRA dropout: 0.0 - Max seq length: 8192 - Packing: True; 4-bit load: False - Epochs: 1 - Per-device batch: 12; grad accumulation: 16; effective batch: 192 - Learning rate: 1e-5; embedding LR: 5e-6 - Scheduler: cosine; warmup ratio: 0.03 - Optimizer: adamw_8bit; weight decay: 0.01 - Seed: 3407; early stopping enabled - Save format: merged_16bit ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data Held-out test set: 1,654 documents (news / socmed / wiki). #### Factors Disaggregated by source domain: news, social media, Wikipedia. #### Metrics Perplexity (lower is better). Eval: ~1M tokens, `max_length=4096`, `stride=1024`, bf16 / 4-bit. ### Results | Model | Full | News | Socmed | Wiki | |---|---|---|---|---| | Qwen3-4B-CPT-Base (this) | 4.561 | 4.108 | 4.418 | 6.492 | | Qwen3-4B-Base (vanilla) | 5.930 | 5.389 | 6.438 | 7.757 | | Improvement | ~23% | ~24% | ~31% | ~16% | #### Summary CPT cuts perplexity ~23% overall vs vanilla Qwen3-4B-Base, and beats vanilla Qwen3-8B-Base on all four subsets. Domain adaptation outweighs raw parameter count for this Indonesian domain. Largest gain on social media (~31%). ## Technical Specifications [optional] ### Model Architecture and Objective Qwen3 causal decoder-only transformer. Objective: continued causal language-model pre-training (next-token prediction). ### Compute Infrastructure #### Hardware NVIDIA A100 80GB (Google Colab Pro+). #### Software Unsloth, TRL, HuggingFace Transformers, PEFT, bitsandbytes. Monitoring: WandB. ## Citation [optional] **BibTeX:** ```bibtex @misc{qwen3_4b_cpt_base, title = {Qwen3-4B-CPT-Base: Indonesian Continued Pre-Training}, author = {AITF UGM 2026}, year = {2026}, note = {Model Narasi Isu pipeline} } ``` **APA:** AITF UGM 2026. (2026). Qwen3-4B-CPT-Base: Indonesian Continued Pre-Training. Model Narasi Isu pipeline. ## More Information Model Narasi Isu: Indonesian public-issue monitoring and narrative analysis pipeline. ## Model Card Authors AITF UGM 2026 ## Model Card Contact https://huggingface.co/aitf-ugm-2026