Aptivra-Base-110M

A compact (110M-parameter) sentence-embedding model for skill routing and semantic retrieval, fine-tuned from intfloat/e5-base-v2. It encodes a user request and a catalog of skill/tool documents into 768-dim vectors so an agent, router, or MCP server can retrieve the most relevant candidates.

Aptivra-Base-110M is an embedding model, not a chat model.

✅ Use it for	❌ Do not use it for
query / document embeddings	chat completion
skill routing	instruction following
semantic retrieval	text generation
vector search	(it produces vectors, not text)
candidate ranking

Shipped in multiple runtimes — PyTorch (safetensors), ONNX, OpenVINO, and GGUF (in the companion repo raghunath1/Aptivra-Base-110M-GGUF) — all producing the same 768-dim embedding.

⚠️ Experimental — validate before production use

This is a research preview, not a validated/production-certified system. What is proven: the base router beats the intfloat/e5-base-v2 baseline at the embedding level on the clean routing eval (Recall@1 0.820 → 0.958, Δ +0.138, 95% CI [0.102, 0.175]). What is NOT validated: per-pack routing quality (domain packs are experimental, unvalidated), high-stakes packs (medical/legal/finance/… are research-only), and the governance layer (not in the served path). A wrong route can cause downstream harm even though this model does not produce the final answer — validate behavior and evidence for your use case, and apply downstream validation, policy, safety, and permission checks before acting on a route. Full disclaimer, evidence, and gate status in the source repository.

Release: v91soup40 (2026-06-08)

This release promotes the boundary-MNRL model soup (v91soup40 = 0.40·boundary-hard-negative MNRL fine-tune + 0.60·prior checkpoint, weight selected on a validation split, reported on an untouched test split). Honest measured deltas vs the prior served checkpoint:

Production (cosine + structural rerank): human-routing R@1 0.9875 → 0.9900, adversarial 0.9743 → 0.9814.
Near-twin discrimination (embedding-only, held-out boundary set): R@1 0.42 → 0.70, R@5 0.59 → 0.92; per-pack hard-negative positive-top-1 0.79 → 0.84.
Abstention separation preserved; no regression on any held-out eval.
Corpus hygiene shipped alongside: eval gold-label repair, genuine-twin registry, and 55 id↔content scrambles fixed without id renames (stable keys kept).

Published formats this release (all v91soup40, parity-validated): safetensors (PyTorch), ONNX (fp32, O3-optimized, int8-dynamic), OpenVINO (fp32). Backend fidelity vs the fp32 torch reference (mean cosine over the 400-query routing eval):

Backend	Precision	Fidelity vs fp32
safetensors (PyTorch)	fp32	1.00000 (reference)
ONNX `model.onnx`	fp32	1.00000
ONNX `model_O3.onnx`	fp32 (O3)	0.99999
ONNX `model_qint8_avx512_vnni.onnx`	int8 dynamic	0.99422
OpenVINO `openvino_model`	fp32	0.99985

Not published: OpenVINO int8 (static-activation PTQ collapses this embedding model — fidelity ≈ 0; weight-only regen pending) and GGUF (convert tooling requires the full llama.cpp source tree, unavailable in the build env). Derive these from safetensors if needed.

Input format (important)

Unlike vanilla e5, this fine-tune was trained and evaluated on plain text — no query: / passage: prefix. Feed the raw request and the raw skill/document text. The model applies mean pooling and returns L2-normalized vectors; compare them with cosine similarity (a dot product, since they are normalized). Adding e5-style prefixes will not reproduce the validated numbers below.

Available formats

This repo (PyTorch + ONNX + OpenVINO):

Path	Backend	Precision	Size	Best for
`model.safetensors`	PyTorch / sentence-transformers	fp32	438 MB	reference; training; GPU
`onnx/model.onnx`	ONNX Runtime	fp32	436 MB	portable CPU/GPU inference
`onnx/model_O3.onnx`	ONNX Runtime	fp32 (graph-opt O3)	436 MB	fastest fp32 on CPU
`onnx/model_qint8_avx2.onnx`	ONNX Runtime	int8	110 MB	x86 CPU (AVX2)
`onnx/model_qint8_avx512.onnx`	ONNX Runtime	int8	110 MB	x86 CPU (AVX-512)
`onnx/model_qint8_avx512_vnni.onnx`	ONNX Runtime	int8	110 MB	x86 CPU (AVX-512 VNNI, e.g. Ice Lake+)
`onnx/model_qint8_arm64.onnx`	ONNX Runtime	int8	110 MB	ARM CPU (Apple Silicon, AWS Graviton)
`openvino/openvino_model.xml` (+`.bin`)	OpenVINO	fp32	436 MB	Intel CPU/iGPU/NPU
`openvino/openvino_model_qint8.xml` (+`.bin`)	OpenVINO	int8 (weight-only)	110 MB	Intel CPU, smaller

GGUF (llama.cpp) lives in the companion repo raghunath1/Aptivra-Base-110M-GGUF: F16, Q8_0, Q4_K_M.

Backend parity (Recall@1) — PRIOR checkpoint only

⚠️ The table below was measured on a prior checkpoint (~5,897-skill corpus) and the binaries it describes are not the v91soup40 release. For this release only safetensors is published; ONNX/OpenVINO/GGUF regeneration is pending (see "Release: v91soup40" above).

Identical 400-query routing eval, ~5,897-skill corpus, plain text, top-1 retrieval. Every backend receives the same (512-token-truncated) input, so deltas are pure backend/quantization effect. safetensors is the reference.

Backend	Precision	Fidelity vs fp32 (mean cosine)	Recall@1
safetensors (PyTorch)	fp32	1.00000 (reference)	0.958
ONNX (`model.onnx`, `model_O3.onnx`)	fp32	1.00000	≡ reference
ONNX int8 (`_qint8_`)	int8	0.98957 ¹	≈ reference
OpenVINO (`openvino_model`)	fp32	0.99986	≡ reference
OpenVINO int8 (`openvino_model_qint8`)	int8 (weight-only)	0.99938	≡ reference
GGUF F16	F16	0.99999	≡ reference
GGUF Q8_0	Q8_0	0.99984	≡ reference
GGUF Q4_K_M	Q4_K_M	0.98618	≈ reference

Method: each backend encodes the identical plain-text routing eval; fidelity = mean cosine of its embeddings to the fp32 reference. When fidelity ≈ 1.0, Recall@1 equals the reference by construction; the int8 / Q4 rows perturb embeddings ~1–1.4% (trading a little accuracy for size/speed).

¹ Measured on the arm64 int8 build (Apple Silicon). The avx2 / avx512 / avx512_vnni files use the same dynamic-int8 recipe for the respective x86 instruction sets.

OpenVINO int8 is weight-only quantization (weights int8, activations fp) — chosen because static activation PTQ collapses this embedding model. Weight-only preserves fidelity (0.99938) at the int8 size (110 MB).

int8 / Q4 are size-and-speed tradeoffs. The table is the honest record of what each precision costs in retrieval accuracy — pick the smallest one whose Recall@1 you can live with.

Usage

Works the same on Windows, Linux, and macOS unless noted. Examples assume Python 3.9+.

1. Python — sentence-transformers (safetensors, recommended reference)

pip install sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("raghunath1/Aptivra-Base-110M")          # PyTorch
queries = ["set up a browser automation task"]
skills  = ["Automate a web browser to click, type, and navigate pages."]

q = model.encode(queries, normalize_embeddings=True)
s = model.encode(skills,  normalize_embeddings=True)
print((q @ s.T))   # cosine similarity

2. ONNX Runtime

Via sentence-transformers (picks the right CPU kernel automatically):

pip install "sentence-transformers[onnx]"        # CPU
# pip install "sentence-transformers[onnx-gpu]"  # NVIDIA GPU

from sentence_transformers import SentenceTransformer

# fp32:
model = SentenceTransformer("raghunath1/Aptivra-Base-110M", backend="onnx")

# int8 — choose the file matching your CPU:
#   x86 (most Intel/AMD): onnx/model_qint8_avx512_vnni.onnx  (or _avx2 on older CPUs)
#   ARM (Apple Silicon, Graviton): onnx/model_qint8_arm64.onnx
model = SentenceTransformer(
    "raghunath1/Aptivra-Base-110M", backend="onnx",
    model_kwargs={"file_name": "onnx/model_qint8_arm64.onnx",
                  "provider": "CPUExecutionProvider"},   # on macOS, pin CPU to avoid CoreML EP
)
emb = model.encode(["semantic search query"], normalize_embeddings=True)

Raw onnxruntime (no sentence-transformers), with manual mean-pool + L2-norm:

import numpy as np, onnxruntime as ort
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("raghunath1/Aptivra-Base-110M")
sess = ort.InferenceSession("onnx/model.onnx", providers=["CPUExecutionProvider"])
enc = tok(["semantic search query"], padding=True, truncation=True, max_length=512, return_tensors="np")
out = sess.run(None, {k: enc[k] for k in ("input_ids","attention_mask","token_type_ids") if k in enc})[0]
mask = enc["attention_mask"][..., None]
emb = (out * mask).sum(1) / np.clip(mask.sum(1), 1e-9, None)   # mean pool
emb /= np.linalg.norm(emb, axis=1, keepdims=True)             # L2 normalize

Windows: onnxruntime ships prebuilt wheels; the avx512_vnni int8 file is fastest on recent Intel.
Linux: same; on AWS Graviton / ARM servers use the arm64 int8 file.
macOS (Apple Silicon): use the arm64 int8 file and pin CPUExecutionProvider (the CoreML EP can fail to build this graph).

3. Transformers.js (browser / Node.js)

import { pipeline } from '@huggingface/transformers';
const extractor = await pipeline('feature-extraction', 'raghunath1/Aptivra-Base-110M',
  { dtype: 'q8' });                       // uses the ONNX int8 weights
const emb = await extractor(['semantic search query'],
  { pooling: 'mean', normalize: true });

Runs in-browser (WebAssembly/WebGPU) and in Node on Windows/Linux/macOS — same code.

4. OpenVINO (Intel CPU / iGPU / NPU)

pip install "sentence-transformers[openvino]"

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("raghunath1/Aptivra-Base-110M", backend="openvino")
# int8: model_kwargs={"file_name": "openvino_model_qint8.xml"}
emb = model.encode(["semantic search query"], normalize_embeddings=True)

Best on Intel hardware (Core/Xeon, Arc, NPU). The Python API is identical on Windows/Linux/macOS; the OpenVINO runtime wheel is installed automatically.

5. llama.cpp / GGUF

GGUF builds are in raghunath1/Aptivra-Base-110M-GGUF. Use embedding mode with mean pooling + L2 normalize:

llama-embedding -m Aptivra-Base-110M-Q8_0.gguf -p "semantic search query" \
  --pooling mean --embd-normalize 2

LM Studio / Ollama caveat: these tools are built around chat/completion models. This is an embedding model — use it only through an embeddings endpoint (e.g. llama-server → POST /v1/embeddings, or Ollama's /api/embeddings), not the chat UI. It will not generate text.

Evaluation

The proven claim (embedding-level lift over baseline). Measured with the reranker OFF — isolating the embedding's own contribution — against intfloat/e5-base-v2, paired bootstrap, on the current corpus:

Eval set (rerank OFF)	baseline	Aptivra	Δ Recall@1	95% CI
Human routing	0.820	0.958	+0.138	[0.102, 0.175]
Weak-pair (noisy)	0.654	0.701	+0.047	[0.040, 0.054]
Adversarial	0.784	0.817	+0.033	[-0.006, 0.071] (not significant)

Full-pipeline numbers (rerank ON) — higher, but the adversarial gains are carried by a lexical reranker that is NOT part of this download:

Evaluation (rerank ON)	Metric
Human routing Recall@1	0.985
Adversarial Recall@1	0.973
Hard-negative pairwise accuracy	0.994

⚠️ These artifacts are the embedding model only. The reranker that produces the rerank-ON numbers lives in the source repository and is not included here. With this model alone you get the rerank-OFF (embedding-only) results — e.g. adversarial Recall@1 ≈ 0.82, not 0.97. The rerank-ON numbers are measured on the routing eval fixtures, and the reranker's structural rules are tuned to those fixture patterns — treat them as in-distribution diagnostics, not a guarantee of open-world generalization.

Metrics are pinned to the current ~5,897-skill corpus snapshot and must be re-derived after corpus changes. Per-pack/domain routing quality is not included here (experimental, unvalidated). Evidence reports (docs/reports/phase-1/) are in the source repository.

How these artifacts were produced

ONNX / OpenVINO: exported from the canonical safetensors with sentence-transformers (export_optimized_onnx_model for O3; export_dynamic_quantized_onnx_model for the int8 variants; backend="openvino" + static int8 PTQ for OpenVINO).
GGUF: converted with llama.cpp convert_hf_to_gguf.py (F16), then llama-quantize for Q8_0 / Q4_K_M.
Every derived/quantized artifact is gated by the backend parity table above before release.

Training Data

Tuned on curated skill-routing data derived from a local skill corpus — positive skill-query pairs, hard negatives, human-style routing queries, and adversarial routing examples.

Limitations

The model is only as good as the skill corpus and routing data used with it. Use it as a retrieval/ranking component, not as the final authority for tool execution. Production systems should keep top-k candidates, apply reranking, and enforce policy, safety, and permission checks downstream. Medical, legal, financial, or safety-critical use is research-only — validate on your own cases first.

License

MIT. Fine-tuned from intfloat/e5-base-v2 (MIT).

Downloads last month: 105

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for raghunath1/Aptivra-Base-110M

Base model

intfloat/e5-base-v2

Quantized

(14)

this model

Quantizations

2 models