--- license: apache-2.0 library_name: transformers pipeline_tag: zero-shot-image-classification tags: - multimodal - image-text-retrieval - bilingual - chinese - english - vision-language - custom-code --- # M2-Encoder-0.4B Hugging Face Export This folder is generated from `Ant-Multi-Modal-Framework/prj/M2_Encoder` and is structured for direct upload to Hugging Face Hub. ## What This Repo Supports - `AutoConfig.from_pretrained(..., trust_remote_code=True)` - `AutoProcessor.from_pretrained(..., trust_remote_code=True)` - `AutoModel.from_pretrained(..., trust_remote_code=True)` - Zero-shot image-text retrieval and zero-shot image classification ## Included Weight File This repo includes the model weight file in the repo root with this exact filename: `m2_encoder_0.4B.safetensors` Large files should be tracked by Git LFS. A `.gitattributes` file is included for that. ## Usage ### ModelScope-equivalent scoring The original ModelScope sample computes probabilities from the raw normalized embeddings: ```python from transformers import AutoModel, AutoProcessor repo_id = "malusama/M2-Encoder-0.4B" model = AutoModel.from_pretrained(repo_id, trust_remote_code=True) processor = AutoProcessor.from_pretrained(repo_id, trust_remote_code=True) text_inputs = processor( text=["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"], return_tensors="pt", ) image_inputs = processor(images="pokemon.jpeg", return_tensors="pt") text_outputs = model(**text_inputs) image_outputs = model(**image_inputs) probs = (image_outputs.image_embeds @ text_outputs.text_embeds.t()).softmax(dim=-1) print(probs) ``` ### CLIP-style logits `model(**inputs)` also returns `logits_per_image` and `logits_per_text`, which use the model's learned `logit_scale`. Those logits are useful, but they are not the same computation as the raw dot product in the original ModelScope demo. ### ONNXRuntime This repo also includes two ONNX exports: - `onnx/text_encoder.onnx` - `onnx/image_encoder.onnx` Example: ```python import importlib import json import os import sys import onnxruntime as ort from huggingface_hub import snapshot_download from PIL import Image repo_id = "malusama/M2-Encoder-0.4B" model_dir = snapshot_download(repo_id=repo_id) sys.path.insert(0, model_dir) tokenizer_config = json.load(open(os.path.join(model_dir, "tokenizer_config.json"), "r", encoding="utf-8")) GLMChineseTokenizer = importlib.import_module("tokenization_glm").GLMChineseTokenizer M2EncoderImageProcessor = importlib.import_module("image_processing_m2_encoder").M2EncoderImageProcessor tokenizer = GLMChineseTokenizer( vocab_file=os.path.join(model_dir, "sp.model"), eos_token=tokenizer_config.get("eos_token"), pad_token=tokenizer_config.get("pad_token"), cls_token=tokenizer_config.get("cls_token"), mask_token=tokenizer_config.get("mask_token"), unk_token=tokenizer_config.get("unk_token"), ) image_processor = M2EncoderImageProcessor.from_pretrained(model_dir) text_inputs = tokenizer( ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"], padding="max_length", truncation=True, max_length=52, return_special_tokens_mask=True, return_tensors="np", ) image_inputs = image_processor(Image.open("pokemon.jpeg").convert("RGB"), return_tensors="np") text_session = ort.InferenceSession( os.path.join(model_dir, "onnx", "text_encoder.onnx"), providers=["CPUExecutionProvider"], ) image_session = ort.InferenceSession( os.path.join(model_dir, "onnx", "image_encoder.onnx"), providers=["CPUExecutionProvider"], ) text_embeds = text_session.run( None, { "input_ids": text_inputs["input_ids"], "attention_mask": text_inputs["attention_mask"], }, )[0] image_embeds = image_session.run( None, {"pixel_values": image_inputs["pixel_values"]}, )[0] ``` ## Upload Option 1: ```bash python upload_to_hub.py --repo-id malusama/M2-Encoder-0.4B ``` Option 2: ```bash huggingface-cli login git init git lfs install git remote add origin https://huggingface.co/malusama/M2-Encoder-0.4B git add . git commit -m "Upload M2-Encoder HF export" git push origin main ``` ## Inference Endpoints This repo also includes a `handler.py` for Hugging Face Inference Endpoints custom deployments. Example request body: ```json { "inputs": { "text": ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"], "image": "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg" }, "parameters": { "return_probs": true, "return_logits": false } } ``` Example response fields: - `text_embedding` - `image_embedding` - `scores` - `probs` - `logits_per_image` when `return_logits=true` ## Notes - This is a Hugging Face remote-code adapter, not a native `transformers` implementation. - The underlying model code still comes from the official M2-Encoder repo. - You need `trust_remote_code=True`. - The `.safetensors` weight file is already included in this Hub repo.