JoyFox-PawScope-VL-AWQ

JoyFox-PawScope-VL-AWQ is the 4-bit AWQ release of joyfox/JoyFox-PawScope-VL, a pet-focused visual language model for cat and dog breed understanding. It keeps the original model's response style: first describe visible pet traits, then provide a natural-language breed judgement and a concise visual rationale.

The model is designed for image-based pet breed demos, pet-care assistants, data annotation workflows, and product prototypes where lower memory usage is preferred. It is not a veterinary diagnostic system and should not be used as the sole source of truth for breed certification.

What The Model Does

Given a pet image and an instruction, the model produces a Chinese response covering:

  • visible appearance traits such as coat color, coat length, face shape, ears, eyes, muzzle, body proportion, and posture,
  • age-stage cues such as adult cat/dog, kitten, or puppy when visually inferable,
  • the most likely cat or dog breed, with a concise reason grounded in the image.

A typical response follows this structure:

从图片看,...

判断结果:这只猫/狗更可能是...。

理由:...

Model Details

Item Description
Model name JoyFox-PawScope-VL-AWQ
Source model joyfox/JoyFox-PawScope-VL
Foundation model openbmb/MiniCPM-V-4_6
Model family MiniCPM-V multimodal model
Released format AWQ 4-bit checkpoint, Safetensors
Quantization AWQ, 4-bit weights, group size 128, zero point enabled, GEMM backend
Primary modality Image + text instruction
Main task Cat and dog breed image understanding
Primary output language Chinese
Recommended image detail mode downsample_mode="4x", max_slice_nums=36
Remote code Required: trust_remote_code=True

Highlights

  • Lower-memory deployment: AWQ 4-bit weights reduce model size while preserving the original pet-focused behavior.
  • Pet-focused visual intelligence: specialized for cat and dog image understanding rather than generic image captioning.
  • Natural judgement format: describes visible traits first, then outputs 判断结果 and 理由 in a stable Chinese style.
  • Fine-grained breed grounding: supports detailed cat and dog breed judgement from visible features.
  • Age-stage awareness: can mention puppy, kitten, or adult cues when they are visually inferable.
  • Practical inference script: the included infer_pet_vision_awq.py uses AutoAWQ loading for direct image inference.

Intended Use

JoyFox-PawScope-VL-AWQ is intended for applications such as:

  • cat and dog breed-recognition demos,
  • pet-care assistants that need image-aware breed explanations,
  • pet image dataset annotation and review workflows,
  • structured labeling of cat/dog image collections,
  • educational tools for comparing common pet breed traits,
  • lower-memory deployment experiments based on the JoyFox-PawScope-VL model family.

The model should be used as an assistive interpretation layer. It can summarize likely visual cues and suggest a likely breed, but it should not replace pedigree documents, professional breed assessment, veterinary care, or direct owner knowledge.

Quick Start With AutoAWQ

This AWQ release is intended to be loaded with an AWQ-compatible runtime. The included example uses AutoAWQForCausalLM.from_quantized() together with the MiniCPM-V processor.

import torch
from awq import AutoAWQForCausalLM
from transformers import AutoProcessor

model_path = "joyfox/JoyFox-PawScope-VL-AWQ"
image_path = "your_pet_image.jpg"

prompt = """请观察图片中的宠物,先自然说明可见外观特征,再判断它最可能的具体品种,并给出理由。

回答格式:
从图片看,...

判断结果:这只猫/狗更可能是...。

理由:..."""

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
awq_model = AutoAWQForCausalLM.from_quantized(
    model_path,
    trust_remote_code=True,
    fuse_layers=False,
)
model = awq_model.model
model.eval()

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "url": image_path},
        {"type": "text", "text": prompt},
    ],
}]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
    downsample_mode="4x",
    max_slice_nums=36,
    enable_thinking=False,
)
inputs = inputs.to(next(model.parameters()).device)

with torch.inference_mode():
    output_ids = model.generate(
        **inputs,
        downsample_mode="4x",
        max_new_tokens=512,
        do_sample=False,
    )

output_ids = [out[len(inp):] for inp, out in zip(inputs.input_ids, output_ids)]
answer = processor.batch_decode(
    output_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)[0]
print(answer.strip())

Using infer_pet_vision_awq.py

The included inference script is designed for direct file-based testing with this AWQ checkpoint. Edit the configuration block at the top of infer_pet_vision_awq.py:

MODEL_PATH = str(MODEL_DIR)
IMAGE_PATH = str(MODEL_DIR / "assets" / "dog_pug_08238.png")
MAX_NEW_TOKENS = 512
DOWNSAMPLE_MODE = "4x"
MAX_SLICE_NUMS = 36
DO_SAMPLE = False
FUSE_LAYERS = False

Then run:

python infer_pet_vision_awq.py

The script resolves local image paths, converts file-based images to a standard temporary JPEG for robust decoding, loads the checkpoint with AutoAWQ, builds a MiniCPM-V image-text message, runs deterministic generation, and prints the decoded answer.

Quantization Notes

This checkpoint uses the MiniCPM-V 4.6 AWQ path with 4-bit GEMM quantization. The small GatedDeltaNet projection modules linear_attn.in_proj_a and linear_attn.in_proj_b are kept unquantized because they participate in scale compensation but are not suitable for AWQ GEMM weight packing. This is reflected in the model quantization_config.

Depending on the Transformers and quantization backend versions in a given environment, direct AutoModelForImageTextToText.from_pretrained() loading may require an AWQ backend such as gptqmodel. When in doubt, use the AutoAWQ loading path shown above or the included infer_pet_vision_awq.py script.

Qualitative Examples

The following examples illustrate the preferred response style: the model first describes visible pet traits, then gives a breed judgement and a short visual rationale. These examples are intended to demonstrate output format and qualitative behavior, not to serve as a benchmark.

Pug

Image: dog_pug_08238.png

Pug example

Example output

这只狗拥有典型的短毛外观,毛色呈现出浅黄褐色,面部有着明显的黑色区域,覆盖了眼睛周围及口鼻部。它的眼睛大而圆,眼神深邃,面部皮肤有明显的褶皱,这是该品种非常显著的特征。从其面部骨骼结构和成熟的眼神来看,这应该是一只成年犬。

判断结果:这只狗更可能是巴哥犬。

理由:该犬具备巴哥犬最核心的品种特征,包括标志性的短毛、浅黄褐色的被毛颜色、深色的面部斑纹以及深陷的褶皱。其大而圆的眼睛和紧凑的口鼻部结构与巴哥犬的品种标准高度吻合,这些特征共同支持了这一判断。

Border Collie

Image: dog_border_collie_07425.png

Border Collie example

Example output

这只狗拥有黑白相间的中长毛,面部中央有明显的白色条纹,眼神专注,耳朵半立,整体身体线条匀称且显得灵活。从头部比例、毛色分布和神态来看,它呈现出成年犬的外观特征。

判断结果:这只狗更可能是边境牧羊犬。

理由:黑白双色被毛、额头到鼻梁的白色纹路、半立耳和专注的表情都是边境牧羊犬常见的视觉特征,这些线索共同支持该判断。

Sphynx Cat

Image: cat_sphynx_01647.png

Sphynx cat example

Example output

这只猫几乎没有明显被毛,皮肤有清晰褶皱,耳朵大而直立,头部轮廓偏楔形,眼睛呈杏仁状,整体体态较成熟,正在柔软的毯子上休息。

判断结果:这只猫更可能是斯芬克斯猫。

理由:近乎无毛的皮肤、大耳、楔形头部和明显皮肤褶皱都是斯芬克斯猫非常突出的品种特征,因此该判断更符合图中可见外观。

Notes On Warnings

You may see warnings about the MiniCPM-V fast path not being available, AutoAWQ deprecation notices, or processor keyword messages. These warnings generally do not block image inference.

For detailed breed recognition, downsample_mode="4x" and max_slice_nums=36 preserve more visual detail. For faster but less detailed inference, use downsample_mode="16x".

Limitations

  • The model is focused on cat and dog images. Other animals, mixed scenes, toys, drawings, or non-pet images may produce unreliable outputs.
  • Breed identification from a single image is inherently uncertain, especially for mixed-breed pets and visually similar breeds.
  • Quantization may introduce small wording or judgement differences compared with the source checkpoint.
  • The model may overstate confidence when the image lacks clear breed-specific features.
  • Lighting, occlusion, grooming style, age, camera angle, and partial body visibility can reduce reliability.
  • The model primarily follows a Chinese response format. English output may require a separate prompt and has not been the main tuning target.
  • The model should not be used for veterinary diagnosis, legal breed certification, insurance decisions, shelter intake decisions, or safety-critical judgement.

License

This model follows the license terms of the source model and the released model metadata. Please also review the license and usage terms of joyfox/JoyFox-PawScope-VL and openbmb/MiniCPM-V-4_6 before redistribution or commercial use.

Acknowledgements

JoyFox-PawScope-VL-AWQ is based on JoyFox-PawScope-VL and OpenBMB's MiniCPM-V-4.6 multimodal model.

Downloads last month
30
Safetensors
Model size
1B params
Tensor type
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for joyfox/JoyFox-PawScope-VL-AWQ

Quantized
(1)
this model