Instructions to use SugoLabs/mmarco-mMiniLMv2-L12-H384-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers.js
How to use SugoLabs/mmarco-mMiniLMv2-L12-H384-v1 with Transformers.js:
// npm i @huggingface/transformers import { pipeline } from '@huggingface/transformers'; // Allocate pipeline const pipe = await pipeline('text-ranking', 'SugoLabs/mmarco-mMiniLMv2-L12-H384-v1');
mmarco-mMiniLMv2-L12-H384-v1 — ONNX for transformers.js
ONNX export of cross-encoder/mmarco-mMiniLMv2-L12-H384-v1, a multilingual cross-encoder reranker trained on mMARCO (the multilingual version of MS MARCO, 14 languages), ready to run in Node.js or the browser via 🤗 transformers.js — no Python required.
We published this conversion because no working ONNX export of this model existed, and the small multilingual cross-encoder league is exactly the class that fits CPU latency budgets for local-first reranking.
Files
| File | Precision | Size |
|---|---|---|
onnx/model.onnx |
fp32 | ~470 MB |
onnx/model_quantized.onnx |
int8 (dynamic) | ~118 MB |
Conversion: optimum-cli export onnx + onnxruntime quantize_dynamic (QInt8, weights-only).
Quality & latency (measured)
- int8 ≈ fp32 ranking quality on our retrieval benchmarks, at ~1.9× the speed and ¼ the size.
- CPU latency (Node.js, onnxruntime, int8): ~30 ms / 10 pairs · ~65 ms / 24 pairs (Ryzen-class desktop CPU).
- As a second-stage reranker over multilingual-e5-large top-30 candidates, on the LoCoMo retrieval benchmark (n=1,536 questions): MRR 0.503 → 0.597, recall@10 0.626 → 0.705. The gain comes from the cross-encoder reading the full candidate text together with the query — text a bi-encoder never embedded.
Usage (transformers.js)
import { AutoTokenizer, AutoModelForSequenceClassification } from '@huggingface/transformers';
const model_id = 'SugoLabs/mmarco-mMiniLMv2-L12-H384-v1';
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const model = await AutoModelForSequenceClassification.from_pretrained(model_id, {
dtype: 'q8', // or 'fp32' for onnx/model.onnx
device: 'cpu',
});
const query = '¿cómo configuro el hook de cierre de sesión?';
const docs = [
'El hook de Stop se cablea en settings.json y recuerda marcar las memorias usadas.',
'La paella valenciana tradicional lleva pollo, conejo y garrofó.',
];
const inputs = tokenizer(new Array(docs.length).fill(query), {
text_pair: docs, padding: true, truncation: true, max_length: 512,
});
const { logits } = await model(inputs);
// One logit per (query, doc) pair; sigmoid → relevance score in [0, 1].
const scores = (await logits.sigmoid().tolist()).map(row => row[0]);
Credits
- Original model: cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 (Apache-2.0), built on mMiniLMv2 (Microsoft) and trained on mMARCO (UNICAMP).
- ONNX conversion and benchmarks: SUGO Labs — published as part of Miura Recall, our local-first associative memory (MCP server) with zero-LLM retrieval hot path.
- Downloads last month
- 16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support