`mmarco-mMiniLMv2-L12-H384-v1` ONNX CPU Dynamic INT8

This repo publishes a plain ONNX Runtime dynamic-int8 quantization of cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 for CPU reranking.

It is intended as an easy-to-download multilingual reranker artifact for English and Irish-language search workloads.

What This Is

Base model: cross-encoder/mmarco-mMiniLMv2-L12-H384-v1
Format: ONNX
Quantization: ONNX Runtime dynamic weight quantization
Weight type: qint8
Quantized ops: MatMul, Gemm, Attention
Primary artifact: model.onnx

This is a derivative artifact of the upstream Apache-2.0 model. Please review the upstream model card for training/background details:

Upstream model: https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1

Public Proxy Results

Measured on a bilingual public proxy reranking suite used for Irish/English screening:

200 queries total
100 English + 100 Irish
20 candidates per query
batch_size=64
max_length=256
threads=32

Quality:

Overall MRR@10: 0.97125
Irish MRR@10: 0.9475
English MRR@10: 0.9950

Runtime on 100 queries:

p50 query latency: 168.9 ms
p95 query latency: 215.8 ms
p99 query latency: 244.3 ms

Important caveat:

These are public proxy numbers, not final in-domain gov.ie relevance judgments.

Files

model.onnx: dynamic-int8 ONNX reranker
config.json: model config
tokenizer.json
tokenizer_config.json
special_tokens_map.json
sentencepiece.bpe.model
artifact_info.json: provenance and quantization details
benchmark_summary.json: machine-readable public benchmark summary

Usage

from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
import numpy as np
import onnxruntime as ort

repo_id = "temsa/mmarco-mMiniLMv2-L12-H384-v1-onnx-cpu-qint8"

model_path = hf_hub_download(repo_id=repo_id, filename="model.onnx")
tokenizer = AutoTokenizer.from_pretrained(repo_id)
session = ort.InferenceSession(
    model_path,
    providers=["CPUExecutionProvider"],
)

pairs = [
    ("how to renew a passport", "Renew your passport online or at a passport office."),
    ("conas pas a athnuachan", "Is féidir do phas a athnuachan ar líne."),
]

encoded = tokenizer(
    [q for q, _ in pairs],
    [d for _, d in pairs],
    padding=True,
    truncation=True,
    max_length=256,
    return_tensors="np",
)

feed = {k: (v.astype(np.int64) if v.dtype != np.int64 else v) for k, v in encoded.items()}
scores = session.run(None, feed)[0].reshape(-1)
print(scores.tolist())

Provenance

This artifact was produced from the published fp32 ONNX export of the upstream model using ONNX Runtime dynamic quantization, with no retraining or calibration.

Downloads last month: 11

Inference Providers NEW

Text Ranking

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for temsa/mmarco-mMiniLMv2-L12-H384-v1-onnx-cpu-qint8

Base model

nreimers/mMiniLMv2-L12-H384-distilled-from-XLMR-Large

Quantized

cross-encoder/mmarco-mMiniLMv2-L12-H384-v1