--- base_model: Qwen/Qwen3.5-4B-Base library_name: transformers language: - id license: apache-2.0 pipeline_tag: image-text-to-text tags: - image-captioning - qwen3.5 - bahasa-indonesia - lora - lora-merged - connector-tuned - vlm - multimodal - json-output datasets: - custom --- # Qwen3.5-4B ImCap — Indonesian Image Captioning (LoRA + Connector, merged) Fine-tune dari [`Qwen/Qwen3.5-4B-Base`](https://huggingface.co/Qwen/Qwen3.5-4B-Base) untuk **image captioning Bahasa Indonesia**. Berbeda dari varian LoRA-only, model ini **juga melatih connector visual (`visual.merger`)** selain LoRA pada language layers; vision backbone tetap beku. Semua sudah di-**merge** ke base weights. --- ## Skema Training | Komponen | Status | Catatan | |---|---|---| | Language layers | **LoRA** (r=32, alpha=64) | q/k/v/o/gate/up/down_proj | | Connector (`visual.merger`) | **Dilatih penuh** (`modules_to_save`) | LR terpisah lebih kecil (2e-5) | | Vision backbone | **Beku** | — | | Status adapter | **Merged** ke base | standalone | | Bahasa output | Bahasa Indonesia | format JSON `{"caption": "..."}` | | `enable_thinking` | `False` | | | EOS token | `<\|im_end\|>` (+ `<\|endoftext\|>`) | | | Precision | bfloat16 | | --- ## Cara Pakai (Inference) ```python import torch from PIL import Image from transformers import AutoProcessor, AutoModelForImageTextToText REPO = "Adicandra/Qwen3.5-4B-ImCap-LoRA-Connector" processor = AutoProcessor.from_pretrained(REPO) model = AutoModelForImageTextToText.from_pretrained( REPO, torch_dtype=torch.bfloat16, device_map="auto", attn_implementation="flash_attention_2", # "sdpa" jika bukan Ampere+ ) model.eval() ``` ### System Prompt (WAJIB sama dengan training) ```python SYSTEM_PROMPT = ( "Annotator dataset image captioning. Tulis caption Bahasa Indonesia yang deskriptif.\n\n" "Aturan:\n" "- Deskripsikan subjek utama, detail visual (warna, posisi, atribut), dan latar belakang.\n" "- Jika gambar mengandung teks penting (meme, infografis, berita, poster), sertakan isi teksnya dalam caption.\n" "- KHUSUS UNTUK GAMBAR MEME: Analisis dan jelaskan makna sarkasme, ironi, atau humor yang terkandung di dalamnya jika ada.\n" "- Panjang caption fleksibel: 2-3 kalimat untuk gambar biasa, lebih panjang jika ada teks/informasi penting atau sarkasme.\n" "- Hanya deskripsikan yang terlihat. Jangan tebak identitas/nama. Jangan awali dengan \"gambar ini menunjukkan\".\n" '- Output: Harus berupa JSON valid dengan format: {"caption": "isi caption disini"}' ) ``` ### Render + Generate ```python import json, re img = Image.open("path/to/image.jpg").convert("RGB") img.thumbnail((560, 560), Image.Resampling.LANCZOS) messages = [ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "Buatkan caption deskriptif untuk gambar ini."}, ]}, ] text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, enable_thinking=False) inputs = processor(text=[text], images=[[img]], return_tensors="pt").to(model.device) input_len = inputs.input_ids.shape[1] im_end_id = processor.tokenizer.convert_tokens_to_ids("<|im_end|>") eot_id = processor.tokenizer.convert_tokens_to_ids("<|endoftext|>") eos_ids = list({im_end_id, eot_id} - {-1}) with torch.no_grad(): out = model.generate( **inputs, max_new_tokens=256, do_sample=False, use_cache=True, eos_token_id=eos_ids, pad_token_id=processor.tokenizer.pad_token_id, ) raw = processor.tokenizer.decode(out[0, input_len:], skip_special_tokens=True) def extract_caption(raw: str): cleaned = re.sub(r".*?", "", raw, flags=re.DOTALL).strip() try: obj = json.loads(cleaned) if isinstance(obj, dict) and "caption" in obj: return obj["caption"], "valid" except Exception: pass m = re.search(r"\{.*\}", cleaned, flags=re.DOTALL) if m: try: obj = json.loads(m.group(0)) if isinstance(obj, dict) and "caption" in obj: return obj["caption"], "recovered" except Exception: pass return cleaned, "invalid" print(extract_caption(raw)) ``` --- ## Keterbatasan - Output **Bahasa Indonesia** saja. - Optimal di GPU dengan FlashAttention-2 (Ampere+); pada T4 gunakan `attn_implementation="sdpa"`. - Jangan tebak identitas/nama orang dari gambar. ## Lisensi Mengikuti base model: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0).