InternVL3.5-8B — W4A16 INT4 (BF16 Vision)

RTN W4A16 INT4 quantization of OpenGVLab/InternVL3_5-8B targeting Ampere GPUs (RTX 3000/4000 series, A100, etc.).

Language model: INT4 pack-quantized (weight_packed int32 + BF16 group scales, group_size=128)
Vision tower (vision_model.*): BF16 — untouched
Vision projector (mlp1.*): BF16 — untouched
Format: compressed-tensors pack-quantized — loaded natively by vllm

Verified results

Test	Result
sqrt(144) + closest planet to Sun	✓ 12, Mercury
Golden Gate Bridge image (Wikimedia)	✓ Correct ID, no hallucination

Usage — vllm (recommended)

vllm serve useful-quants/InternVL3-5-8B-W4A16-INT4 \
    --dtype bfloat16 \
    --max-model-len 8192 \
    --max-num-seqs 16 \
    --gpu-memory-utilization 0.90

Usage — transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "useful-quants/InternVL3-5-8B-W4A16-INT4",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "useful-quants/InternVL3-5-8B-W4A16-INT4",
    trust_remote_code=True,
)

# Text-only
response = model.chat(tokenizer, pixel_values=None,
                      question="Hello!", generation_config=dict(max_new_tokens=256))

# With image
from PIL import Image
import torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode

MEAN, STD = (0.485, 0.456, 0.406), (0.229, 0.224, 0.225)
tf = T.Compose([
    T.Resize((448, 448), interpolation=InterpolationMode.BICUBIC),
    T.ToTensor(), T.Normalize(MEAN, STD),
])
pixel_values = tf(Image.open("image.jpg").convert("RGB")).unsqueeze(0)
pixel_values = pixel_values.to(model.device, dtype=torch.bfloat16)

response = model.chat(tokenizer, pixel_values=pixel_values,
                      question="<image>\nDescribe this image.",
                      generation_config=dict(max_new_tokens=256))

Notes

Quantization tool: llmcompressor with QuantizationModifier(scheme="W4A16")
trust_remote_code=True required for both loading and inference
The tokenizer regex warning (fix_mistral_regex) is cosmetic and does not affect output quality