---
license: apache-2.0
library_name: transformers
pipeline_tag: zero-shot-image-classification
tags:
- multimodal
- image-text-retrieval
- bilingual
- chinese
- english
- vision-language
- custom-code
---

# M2-Encoder-0.4B Hugging Face Export

This folder is generated from `Ant-Multi-Modal-Framework/prj/M2_Encoder` and is structured for direct upload to Hugging Face Hub.

## What This Repo Supports

- `AutoConfig.from_pretrained(..., trust_remote_code=True)`
- `AutoProcessor.from_pretrained(..., trust_remote_code=True)`
- `AutoModel.from_pretrained(..., trust_remote_code=True)`
- Zero-shot image-text retrieval and zero-shot image classification

## Included Weight File

This repo includes the model weight file in the repo root with this exact filename:

`m2_encoder_0.4B.safetensors`

Large files should be tracked by Git LFS. A `.gitattributes` file is included for that.

## Usage

### ModelScope-equivalent scoring

The original ModelScope sample computes probabilities from the raw normalized embeddings:

```python
from transformers import AutoModel, AutoProcessor

repo_id = "malusama/M2-Encoder-0.4B"

model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(repo_id, trust_remote_code=True)

text_inputs = processor(
    text=["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"],
    return_tensors="pt",
)
image_inputs = processor(images="pokemon.jpeg", return_tensors="pt")

text_outputs = model(**text_inputs)
image_outputs = model(**image_inputs)

probs = (image_outputs.image_embeds @ text_outputs.text_embeds.t()).softmax(dim=-1)
print(probs)
```

### CLIP-style logits

`model(**inputs)` also returns `logits_per_image` and `logits_per_text`, which use the model's learned `logit_scale`.
Those logits are useful, but they are not the same computation as the raw dot product in the original ModelScope demo.

### ONNXRuntime

This repo also includes two ONNX exports:

- `onnx/text_encoder.onnx`
- `onnx/image_encoder.onnx`

Example:

```python
import importlib
import json
import os
import sys

import onnxruntime as ort
from huggingface_hub import snapshot_download
from PIL import Image

repo_id = "malusama/M2-Encoder-0.4B"
model_dir = snapshot_download(repo_id=repo_id)
sys.path.insert(0, model_dir)

tokenizer_config = json.load(open(os.path.join(model_dir, "tokenizer_config.json"), "r", encoding="utf-8"))
GLMChineseTokenizer = importlib.import_module("tokenization_glm").GLMChineseTokenizer
M2EncoderImageProcessor = importlib.import_module("image_processing_m2_encoder").M2EncoderImageProcessor

tokenizer = GLMChineseTokenizer(
    vocab_file=os.path.join(model_dir, "sp.model"),
    eos_token=tokenizer_config.get("eos_token"),
    pad_token=tokenizer_config.get("pad_token"),
    cls_token=tokenizer_config.get("cls_token"),
    mask_token=tokenizer_config.get("mask_token"),
    unk_token=tokenizer_config.get("unk_token"),
)
image_processor = M2EncoderImageProcessor.from_pretrained(model_dir)

text_inputs = tokenizer(
    ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"],
    padding="max_length",
    truncation=True,
    max_length=52,
    return_special_tokens_mask=True,
    return_tensors="np",
)
image_inputs = image_processor(Image.open("pokemon.jpeg").convert("RGB"), return_tensors="np")

text_session = ort.InferenceSession(
    os.path.join(model_dir, "onnx", "text_encoder.onnx"),
    providers=["CPUExecutionProvider"],
)
image_session = ort.InferenceSession(
    os.path.join(model_dir, "onnx", "image_encoder.onnx"),
    providers=["CPUExecutionProvider"],
)

text_embeds = text_session.run(
    None,
    {
        "input_ids": text_inputs["input_ids"],
        "attention_mask": text_inputs["attention_mask"],
    },
)[0]
image_embeds = image_session.run(
    None,
    {"pixel_values": image_inputs["pixel_values"]},
)[0]
```

## Upload

Option 1:

```bash
python upload_to_hub.py --repo-id malusama/M2-Encoder-0.4B
```

Option 2:

```bash
huggingface-cli login
git init
git lfs install
git remote add origin https://huggingface.co/malusama/M2-Encoder-0.4B
git add .
git commit -m "Upload M2-Encoder HF export"
git push origin main
```

## Inference Endpoints

This repo also includes a `handler.py` for Hugging Face Inference Endpoints custom deployments.

Example request body:

```json
{
  "inputs": {
    "text": ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"],
    "image": "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
  },
  "parameters": {
    "return_probs": true,
    "return_logits": false
  }
}
```

Example response fields:

- `text_embedding`
- `image_embedding`
- `scores`
- `probs`
- `logits_per_image` when `return_logits=true`

## Notes

- This is a Hugging Face remote-code adapter, not a native `transformers` implementation.
- The underlying model code still comes from the official M2-Encoder repo.
- You need `trust_remote_code=True`.
- The `.safetensors` weight file is already included in this Hub repo.