Instructions to use temsa/mmarco-mMiniLMv2-L12-H384-v1-onnx-cpu-qint8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use temsa/mmarco-mMiniLMv2-L12-H384-v1-onnx-cpu-qint8 with sentence-transformers:
from sentence_transformers import CrossEncoder model = CrossEncoder("temsa/mmarco-mMiniLMv2-L12-H384-v1-onnx-cpu-qint8") query = "Which planet is known as the Red Planet?" passages = [ "Venus is often called Earth's twin because of its similar size and proximity.", "Mars, known for its reddish appearance, is often referred to as the Red Planet.", "Jupiter, the largest planet in our solar system, has a prominent red spot.", "Saturn, famous for its rings, is sometimes mistaken for the Red Planet." ] scores = model.predict([(query, passage) for passage in passages]) print(scores) - Notebooks
- Google Colab
- Kaggle
mmarco-mMiniLMv2-L12-H384-v1 ONNX CPU Dynamic INT8
This repo publishes a plain ONNX Runtime dynamic-int8 quantization of cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 for CPU reranking.
It is intended as an easy-to-download multilingual reranker artifact for English and Irish-language search workloads.
What This Is
- Base model:
cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 - Format: ONNX
- Quantization: ONNX Runtime dynamic weight quantization
- Weight type:
qint8 - Quantized ops:
MatMul,Gemm,Attention - Primary artifact:
model.onnx
This is a derivative artifact of the upstream Apache-2.0 model. Please review the upstream model card for training/background details:
Public Proxy Results
Measured on a bilingual public proxy reranking suite used for Irish/English screening:
200queries total100English +100Irish20candidates per querybatch_size=64max_length=256threads=32
Quality:
- Overall
MRR@10:0.97125 - Irish
MRR@10:0.9475 - English
MRR@10:0.9950
Runtime on 100 queries:
p50query latency:168.9 msp95query latency:215.8 msp99query latency:244.3 ms
Important caveat:
- These are public proxy numbers, not final in-domain
gov.ierelevance judgments.
Files
model.onnx: dynamic-int8 ONNX rerankerconfig.json: model configtokenizer.jsontokenizer_config.jsonspecial_tokens_map.jsonsentencepiece.bpe.modelartifact_info.json: provenance and quantization detailsbenchmark_summary.json: machine-readable public benchmark summary
Usage
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
import numpy as np
import onnxruntime as ort
repo_id = "temsa/mmarco-mMiniLMv2-L12-H384-v1-onnx-cpu-qint8"
model_path = hf_hub_download(repo_id=repo_id, filename="model.onnx")
tokenizer = AutoTokenizer.from_pretrained(repo_id)
session = ort.InferenceSession(
model_path,
providers=["CPUExecutionProvider"],
)
pairs = [
("how to renew a passport", "Renew your passport online or at a passport office."),
("conas pas a athnuachan", "Is féidir do phas a athnuachan ar líne."),
]
encoded = tokenizer(
[q for q, _ in pairs],
[d for _, d in pairs],
padding=True,
truncation=True,
max_length=256,
return_tensors="np",
)
feed = {k: (v.astype(np.int64) if v.dtype != np.int64 else v) for k, v in encoded.items()}
scores = session.run(None, feed)[0].reshape(-1)
print(scores.tolist())
Provenance
This artifact was produced from the published fp32 ONNX export of the upstream model using ONNX Runtime dynamic quantization, with no retraining or calibration.
- Downloads last month
- 11
from sentence_transformers import CrossEncoder model = CrossEncoder("temsa/mmarco-mMiniLMv2-L12-H384-v1-onnx-cpu-qint8") query = "Which planet is known as the Red Planet?" passages = [ "Venus is often called Earth's twin because of its similar size and proximity.", "Mars, known for its reddish appearance, is often referred to as the Red Planet.", "Jupiter, the largest planet in our solar system, has a prominent red spot.", "Saturn, famous for its rings, is sometimes mistaken for the Red Planet." ] scores = model.predict([(query, passage) for passage in passages]) print(scores)