mmarco-mMiniLMv2-L12-H384-v1 — ONNX for transformers.js

ONNX export of cross-encoder/mmarco-mMiniLMv2-L12-H384-v1, a multilingual cross-encoder reranker trained on mMARCO (the multilingual version of MS MARCO, 14 languages), ready to run in Node.js or the browser via 🤗 transformers.js — no Python required.

We published this conversion because no working ONNX export of this model existed, and the small multilingual cross-encoder league is exactly the class that fits CPU latency budgets for local-first reranking.

Files

File Precision Size
onnx/model.onnx fp32 ~470 MB
onnx/model_quantized.onnx int8 (dynamic) ~118 MB

Conversion: optimum-cli export onnx + onnxruntime quantize_dynamic (QInt8, weights-only).

Quality & latency (measured)

  • int8 ≈ fp32 ranking quality on our retrieval benchmarks, at ~1.9× the speed and ¼ the size.
  • CPU latency (Node.js, onnxruntime, int8): ~30 ms / 10 pairs · ~65 ms / 24 pairs (Ryzen-class desktop CPU).
  • As a second-stage reranker over multilingual-e5-large top-30 candidates, on the LoCoMo retrieval benchmark (n=1,536 questions): MRR 0.503 → 0.597, recall@10 0.626 → 0.705. The gain comes from the cross-encoder reading the full candidate text together with the query — text a bi-encoder never embedded.

Usage (transformers.js)

import { AutoTokenizer, AutoModelForSequenceClassification } from '@huggingface/transformers';

const model_id = 'SugoLabs/mmarco-mMiniLMv2-L12-H384-v1';
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const model = await AutoModelForSequenceClassification.from_pretrained(model_id, {
    dtype: 'q8',      // or 'fp32' for onnx/model.onnx
    device: 'cpu',
});

const query = '¿cómo configuro el hook de cierre de sesión?';
const docs = [
    'El hook de Stop se cablea en settings.json y recuerda marcar las memorias usadas.',
    'La paella valenciana tradicional lleva pollo, conejo y garrofó.',
];

const inputs = tokenizer(new Array(docs.length).fill(query), {
    text_pair: docs, padding: true, truncation: true, max_length: 512,
});
const { logits } = await model(inputs);
// One logit per (query, doc) pair; sigmoid → relevance score in [0, 1].
const scores = (await logits.sigmoid().tolist()).map(row => row[0]);

Credits

Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SugoLabs/mmarco-mMiniLMv2-L12-H384-v1