How to use from the
Use from the
sentence-transformers library
from sentence_transformers import CrossEncoder

model = CrossEncoder("temsa/mmarco-mMiniLMv2-L12-H384-v1-onnx-cpu-qint8")

query = "Which planet is known as the Red Planet?"
passages = [
	"Venus is often called Earth's twin because of its similar size and proximity.",
	"Mars, known for its reddish appearance, is often referred to as the Red Planet.",
	"Jupiter, the largest planet in our solar system, has a prominent red spot.",
	"Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]

scores = model.predict([(query, passage) for passage in passages])
print(scores)

mmarco-mMiniLMv2-L12-H384-v1 ONNX CPU Dynamic INT8

This repo publishes a plain ONNX Runtime dynamic-int8 quantization of cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 for CPU reranking.

It is intended as an easy-to-download multilingual reranker artifact for English and Irish-language search workloads.

What This Is

  • Base model: cross-encoder/mmarco-mMiniLMv2-L12-H384-v1
  • Format: ONNX
  • Quantization: ONNX Runtime dynamic weight quantization
  • Weight type: qint8
  • Quantized ops: MatMul, Gemm, Attention
  • Primary artifact: model.onnx

This is a derivative artifact of the upstream Apache-2.0 model. Please review the upstream model card for training/background details:

Public Proxy Results

Measured on a bilingual public proxy reranking suite used for Irish/English screening:

  • 200 queries total
  • 100 English + 100 Irish
  • 20 candidates per query
  • batch_size=64
  • max_length=256
  • threads=32

Quality:

  • Overall MRR@10: 0.97125
  • Irish MRR@10: 0.9475
  • English MRR@10: 0.9950

Runtime on 100 queries:

  • p50 query latency: 168.9 ms
  • p95 query latency: 215.8 ms
  • p99 query latency: 244.3 ms

Important caveat:

  • These are public proxy numbers, not final in-domain gov.ie relevance judgments.

Files

  • model.onnx: dynamic-int8 ONNX reranker
  • config.json: model config
  • tokenizer.json
  • tokenizer_config.json
  • special_tokens_map.json
  • sentencepiece.bpe.model
  • artifact_info.json: provenance and quantization details
  • benchmark_summary.json: machine-readable public benchmark summary

Usage

from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
import numpy as np
import onnxruntime as ort

repo_id = "temsa/mmarco-mMiniLMv2-L12-H384-v1-onnx-cpu-qint8"

model_path = hf_hub_download(repo_id=repo_id, filename="model.onnx")
tokenizer = AutoTokenizer.from_pretrained(repo_id)
session = ort.InferenceSession(
    model_path,
    providers=["CPUExecutionProvider"],
)

pairs = [
    ("how to renew a passport", "Renew your passport online or at a passport office."),
    ("conas pas a athnuachan", "Is féidir do phas a athnuachan ar líne."),
]

encoded = tokenizer(
    [q for q, _ in pairs],
    [d for _, d in pairs],
    padding=True,
    truncation=True,
    max_length=256,
    return_tensors="np",
)

feed = {k: (v.astype(np.int64) if v.dtype != np.int64 else v) for k, v in encoded.items()}
scores = session.run(None, feed)[0].reshape(-1)
print(scores.tolist())

Provenance

This artifact was produced from the published fp32 ONNX export of the upstream model using ONNX Runtime dynamic quantization, with no retraining or calibration.

Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for temsa/mmarco-mMiniLMv2-L12-H384-v1-onnx-cpu-qint8