--- license: cc-by-nc-4.0 language: - en tags: - embeddings - dense-retrieval - matryoshka - rag - agents - mteb - sentence-similarity - semantic-search - text-embeddings - text-embedding - vector-search - document-retrieval - similarity-search - classification - clustering - edge-ai - on-device - local-inference - efficient-ai - rag-retrieval library_name: ogma metrics: - mteb model-index: - name: axiotic/ogma-mini results: - task: type: sts dataset: name: MTEB STSBenchmark type: mteb/stsbenchmark-sts split: test revision: b0fddb56ed78048fa8b90373c8a3cfc37b684831 metrics: - type: cosine_spearman value: 77.71 - task: type: classification dataset: name: MTEB AmazonPolarityClassification type: mteb/amazon_polarity split: test metrics: - type: accuracy value: 61.8 - task: type: clustering dataset: name: MTEB RedditClustering type: mteb/reddit-clustering split: test metrics: - type: v_measure value: 37.38 - task: type: pair-classification dataset: name: MTEB TwitterSemEval2015 type: mteb/twittersemeval2015-pairclassification split: test metrics: - type: cos_sim_ap value: 79.66 - task: type: reranking dataset: name: MTEB MindSmallReranking type: mteb/mind_small split: validation metrics: - type: map value: 47.39 - task: type: retrieval dataset: name: MTEB MSMARCO type: mteb/msmarco split: dev metrics: - type: ndcg_at_10 value: 36.21 - task: type: summarization dataset: name: MTEB SummEval type: mteb/summeval split: test metrics: - type: cos_sim_spearman value: 31.33 pipeline_tag: sentence-similarity --- # ogma-mini  ยท  3.5M efficient text embedding model  ยท  MTEB 53.07 > Small English text embedding model for semantic search, RAG, vector search, clustering, classification, and agent memory โ€” MTEB 53.07, 3.5M parameters, 1024-token context **Ogma Mini** is built for edge and resource-constrained deployment. At 3.5M parameters and 14 MB it scores **53.07 MTEB** โ€” 1.41 points above the 32M-parameter Potion-32M (51.66), while fitting in a fraction of the memory. Ideal for mobile, IoT, browser, and serverless embedding workloads. ## Why the name Ogma? Ogma is named after **Ogma** (also written Oghma), the Irish god associated with eloquence and credited in myth with inventing **Ogham**, an early alphabet for encoding language into symbols. That is the core job of an embedding model: turn language into compact vectors that machines can search, compare, cluster, and reason over. --- ## Use cases ogma-mini is a compact embedding model for **on-device AI**, **edge retrieval**, **local RAG**, **agent memory**, **semantic search**, **classification**, **clustering**, and resource-constrained applications that still need contextual text representations. Good fits: - **Mobile, desktop, and embedded applications** that need a small local embedding model. - **Private search and local RAG** over user files, app data, transcripts, tickets, or knowledge-base snippets. - **Serverless inference** where cold-start size and memory ceilings are real constraints. - **Agent memory stores** where embeddings are generated frequently and cost needs to stay low. - **Efficient retrieval pipelines** that need 1024-token context without a large transformer footprint. Choose ogma-mini when you want a stronger model than micro while staying tiny enough for edge and on-device deployments. --- ## Highlights - ๐Ÿ† **MTEB avg 53.07** โ€” beats Potion-32M (51.66) at **3.5M parameters** (9ร— fewer params) - ๐Ÿ“ฆ **14 MB** โ€” fits in browser and mobile memory budgets - ๐Ÿ“ **1024-token context** โ€” 4ร— longer than all-MiniLM-L6-v2 (256 tokens) - ๐Ÿ”€ **Asymmetric encoding** via task tokens: `[QRY]`, `[DOC]`, `[SYM]` - ๐Ÿ“ **Matryoshka dims**: [256, 128, 64, 32] โ€” compress to 32d for ultra-low latency --- ## Performance ### MTEB English โ€” 54/54 tasks (category-averaged) Benchmarked with [MTEB](https://github.com/embeddings-benchmark/mteb) v2.10.7 on the standard 54-task English benchmark using category averaging (same methodology as the MTEB leaderboard). | Category | ogma-mini | all-MiniLM-L6-v2 | ฮ” vs MiniLM | |---|---|---|---| | Classification | **61.80** | 62.62 | -0.82 | | Clustering | **37.38** | 41.94 | -4.56 | | PairClassification | **79.66** | 82.37 | -2.71 | | Reranking | **47.39** | 58.04 | -10.65 | | Retrieval | **36.21** | 41.95 | -5.74 | | STS | **77.71** | 78.90 | -1.19 | | Summarization | **31.33** | 30.81 | +0.52 | | **Overall** | **53.07** | *56.09* | **-3.02** | > MiniLM and Potion reference scores from the [Model2Vec results page](https://github.com/MinishLab/model2vec/blob/main/results/README.md). ### Why choose Ogma Mini? ogma-mini is the right choice when parameter count and memory are hard constraints. It outperforms Potion-32M despite being 9ร— smaller. Use **ogma-small** when you can afford 8.6M parameters; use **ogma-micro** when you need to go below 3M. ### CPU Inference Benchmark Benchmarked on AMD Ryzen Threadripper PRO 3955WX (16-core/32-thread), PyTorch 2.10, batch of 100 mixed-length documents. | Model | Params | 1Tยทbs1 (docs/s) | 1Tยทbs1 latency | 1Tยทbs32 (docs/s) | 16Tยทbs32 (docs/s) | |---|---|---|---|---|---| | potion-base-8M | 7.6M | 6,892 | 0.14 ms | 18,021 | 17,040 | | potion-base-32M | 32.3M | 6,826 | 0.15 ms | 17,984 | 17,328 | | **ogma-small** | **8.6M** | **92.9** | **10.8 ms** | **60.9** | **255.6** | | all-MiniLM-L6-v2 | 22.7M | 53.1 | 18.8 ms | 40.5 | 227.9 | | **ogma-base** | **13.3M** | **48.3** | **20.7 ms** | **28.9** | **121.6** | | bge-small-en-v1.5 | 33.4M | 26.8 | 37.3 ms | 19.8 | 115.3 | | ogma-mini | 3.5M | ~200 | ~5 ms | ~130 | ~450 | | bge-base-en-v1.5 | 109.5M | 7.6 | 131.7 ms | 4.8 | 30.2 | > Potion models are static (lookup-based); their near-zero inference cost is the trade-off for no contextual understanding and fixed 256-token context. Transformer models like Ogma and MiniLM understand context. **ogma-small is 1.75ร— faster** than MiniLM single-threaded and **1.12ร— faster** batched. ### Safety โ€” Toxicity & Prompt Injection Detection Evaluated on the Ogma transformer architecture (same family). Embeddings are extracted then fed to a logistic regression (LR) or MLP classifier head โ€” the embedding model itself is not fine-tuned. Evaluated against `all-MiniLM-L6-v2` as baseline. #### 1. Jigsaw Toxic Comment Classification **Dataset:** `Arsive/toxicity_classification_jigsaw` โ€” Binary toxicity classification **Train:** 25,960 ยท **Test:** 6,490 | Model | Classifier | Accuracy | F1 | Precision | Recall | AUC-ROC | |---|---|---|---|---|---|---| | **Ogma** | LogReg | 89.12% | **88.26%** | 89.09% | 87.44% | 95.74% | | **Ogma** | MLP | 88.91% | 87.98% | 89.14% | 86.85% | 95.92% | | MiniLM | LogReg | 87.32% | 86.25% | 87.46% | 85.07% | 94.96% | | MiniLM | MLP | 91.71% | 91.24% | 90.13% | 92.39% | **97.16%** | Ogma (LR) leads MiniLM (LR) by **+2.01% F1**. MiniLM (MLP) leads on this dataset โ€” the additional training data (25K samples) allows the MLP to compensate for MiniLM's slightly weaker base representations. #### 2. Prompt Injection Detection โ€” deepset/prompt-injections **Dataset:** `deepset/prompt-injections` โ€” Binary injection detection **Train:** 546 ยท **Test:** 116 (low-data regime) | Model | Classifier | Accuracy | F1 | Precision | Recall | AUC-ROC | |---|---|---|---|---|---|---| | **Ogma** | LogReg | 86.21% | 84.62% | **100.0%** | 73.33% | **97.77%** | | **Ogma** | MLP | **90.52%** | **90.27%** | 96.23% | 85.0% | 98.1% | | MiniLM | LogReg | 82.76% | 80.39% | 97.62% | 68.33% | 94.52% | | MiniLM | MLP | 87.07% | 86.24% | 95.92% | 78.33% | 93.96% | Ogma leads across both classifiers: **+4.03% F1 (MLP)**, **+4.23% F1 (LogReg)**. Ogma's representations are better separated in the low-data regime โ€” it achieves 100% precision with LogReg, meaning zero false positives. #### 3. Prompt Injection Detection โ€” neuralchemy/Prompt-injection-dataset **Dataset:** `neuralchemy/Prompt-injection-dataset` โ€” Binary injection detection **Train:** 4,391 ยท **Test:** 942 | Model | Classifier | Accuracy | F1 | Precision | Recall | AUC-ROC | |---|---|---|---|---|---|---| | **Ogma** | LogReg | 95.22% | 95.93% | 95.84% | **96.01%** | **99.30%** | | **Ogma** | MLP | **95.44%** | **96.16%** | 94.89% | 97.46% | **99.37%** | | MiniLM | LogReg | 94.59% | 95.38% | 95.46% | 95.29% | 98.92% | | MiniLM | MLP | 93.95% | 94.85% | 94.59% | 95.11% | 98.92% | Ogma leads across all metrics: **+0.78% F1 (MLP)**, **+0.55% F1 (LR)**. Both models perform well at scale; Ogma maintains its edge and achieves higher AUC-ROC (99.37% vs 98.92%). #### Summary | Task | Ogma best F1 | MiniLM best F1 | ฮ” | |---|---|---|---| | Jigsaw Toxicity | 88.26% (LR) | 91.24% (MLP) | โˆ’2.98% | | deepset Injection | **90.27% (MLP)** | 86.24% (MLP) | **+4.03%** | | neuralchemy Injection | **96.16% (MLP)** | 95.38% (LR) | **+0.78%** | Ogma is a stronger feature extractor for **prompt injection detection** โ€” the safety-critical task for agent pipelines. MiniLM edges ahead on toxicity when given sufficient labelled data and a more powerful classifier head. For agentic use cases where detecting adversarial instructions is the priority, Ogma representations are the better choice. --- ## Architecture | Property | Value | |---|---| | Architecture | Custom Transformer | | Internal dim (`d_model`) | 256 | | Output dim (`d_output`) | 256 | | Transformer layers | 2 | | Attention heads | 4 | | Vocabulary | 30,000 (SentencePiece / AlbertTokenizer) | | Max sequence length | **1,024 tokens** | | Pooling | Mean pooling | | Task tokens | `[QRY]` (query), `[DOC]` (document), `[SYM]` (symmetric) | | Matryoshka dims | [32, 64, 128, 256] | | Output normalisation | L2 (unit sphere) | | Parameters | 3.5M | | Model file | `model.safetensors` (14 MB) | **Key design choices:** - **Task token prepend:** A learnable task token (`[QRY]`, `[DOC]`, or `[SYM]`) is prepended to the input sequence before the transformer. This enables true asymmetric encoding in a single model with a single forward pass. - **Matryoshka training:** The model is trained with Matryoshka Representation Learning, meaning embeddings truncated to any supported sub-dimension remain well-calibrated without retraining. - **Mean pooling:** The average of all token outputs (excluding padding) produces the sentence embedding, which consistently outperforms CLS-token pooling in the Ogma architecture family. - **L2 normalisation:** All outputs are unit-normalised; cosine similarity == dot product == euclidean similarity (up to a constant), simplifying downstream usage. --- ## Usage ### Installation ```bash pip install torch tokenizers huggingface_hub pyyaml ``` ### Basic Encoding ```python from huggingface_hub import snapshot_download from tokenizers import Tokenizer import sys, torch # 1. Download model files model_path = snapshot_download("axiotic/ogma-mini") # 2. Load model (bundled source code) sys.path.insert(0, model_path) from ogma_model import OgmaModel model = OgmaModel.from_checkpoint(model_path, device="cpu") model.eval() # 3. Tokenizer N_SPECIAL = 7 _tok = Tokenizer.from_file(f"{model_path}/tokenizer.json") def encode(texts: list, max_length: int = 1024): all_ids = [] for text in texts: enc = _tok.encode(text) ids, toks = enc.ids, enc.tokens # Strip CLS/SEP added by tokenizer if toks and toks[0] in ("[CLS]", ""): ids, toks = ids[1:], toks[1:] if toks and toks[-1] in ("[SEP]", ""): ids = ids[:-1] # Shift into Ogma's vocabulary space and add BOS/EOS ogma_ids = [2] + [rid + N_SPECIAL for rid in ids] + [3] all_ids.append(ogma_ids[:max_length]) ml = max(len(ids) for ids in all_ids) token_ids = torch.zeros(len(texts), ml, dtype=torch.long) attn_mask = torch.zeros(len(texts), ml, dtype=torch.long) for i, ids in enumerate(all_ids): token_ids[i, :len(ids)] = torch.tensor(ids) attn_mask[i, :len(ids)] = 1 return token_ids, attn_mask # 4. Encode (symmetric mode โ€” good for clustering, classification, STS) from config import TaskToken sentences = [ "The quick brown fox jumps over the lazy dog", "A fast auburn vulpine leaps over an idle canine", ] with torch.no_grad(): token_ids, attn_mask = encode(sentences) embeddings = model.encode(token_ids, attn_mask, task=TaskToken.SYM) print(embeddings.shape) # (256,) sim = (embeddings[0] @ embeddings[1]).item() print(f"Cosine similarity: {sim:.4f}") # L2-normalised, dot product = cosine ``` ### Asymmetric Retrieval (Query / Document) Use `TaskToken.QRY` for query embeddings and `TaskToken.DOC` for document embeddings in retrieval pipelines. This asymmetric encoding is a first-class feature of the Ogma architecture. ```python # Asymmetric retrieval โ€” encode queries with QRY, passages with DOC from config import TaskToken queries = [ "What is knowledge distillation?", "How does retrieval-augmented generation work?", ] documents = [ "Knowledge distillation trains a smaller student model to mimic a larger teacher...", "Retrieval-Augmented Generation (RAG) combines a dense retriever with a language model...", ] with torch.no_grad(): q_ids, q_mask = encode(queries) d_ids, d_mask = encode(documents) q_emb = model.encode(q_ids, q_mask, task=TaskToken.QRY) # (N, 256) d_emb = model.encode(d_ids, d_mask, task=TaskToken.DOC) # (M, 256) # Dot product == cosine similarity (embeddings are L2-normalised) scores = q_emb @ d_emb.T # (N, M) print(scores) ``` ### Matryoshka โ€” Flexible Dimensionality Ogma supports Matryoshka Representation Learning. Truncate and re-normalise to any supported sub-dimension for faster indexing or lower memory usage โ€” no retraining required. ```python import torch.nn.functional as F with torch.no_grad(): token_ids, attn_mask = encode(sentences) emb_full = model.encode(token_ids, attn_mask) # (256d, full precision) # Truncate to any supported sub-dimension and re-normalise โ€” no retraining needed # Supported dims: [32, 64, 128, 256] emb_32 = torch.nn.functional.normalize(emb_full[:, :32], dim=-1) emb_64 = torch.nn.functional.normalize(emb_full[:, :64], dim=-1) emb_128 = torch.nn.functional.normalize(emb_full[:, :128], dim=-1) ``` ### LangChain Integration ```python # LangChain integration (custom embeddings class) from langchain.embeddings.base import Embeddings from huggingface_hub import snapshot_download from tokenizers import Tokenizer from config import TaskToken import sys, torch class OgmaEmbeddings(Embeddings): def __init__(self, model_name: str = "axiotic/ogma-mini", device: str = "cpu"): model_path = snapshot_download(model_name) sys.path.insert(0, model_path) from ogma_model import OgmaModel self.model = OgmaModel.from_checkpoint(model_path, device=device) self.model.eval() self._tok = Tokenizer.from_file(f"{model_path}/tokenizer.json") self._device = device def _encode(self, texts, task=TaskToken.SYM): # (encode function from Basic Usage above) from your_module import encode # or inline the encode function with torch.no_grad(): ids, mask = encode(texts) return self.model.encode(ids.to(self._device), mask.to(self._device), task=task) def embed_documents(self, texts): return self._encode(texts, task=TaskToken.DOC).cpu().numpy().tolist() def embed_query(self, text): return self._encode([text], task=TaskToken.QRY).cpu().numpy()[0].tolist() embeddings = OgmaEmbeddings() ``` --- ## Model Family | Model | Params | Size | MTEB Avg | Class | Clust | PairClass | Rerank | Ret | STS | Summ | d_out | Context | |---|---|---|---|---|---|---|---|---|---|---|---|---| | **[ogma-large](https://huggingface.co/axiotic/ogma-large)** | 32.4M | 124 MB | **57.41** | 68.6 | 41.6 | 84.0 | 53.1 | 43.7 | 83.7 | 30.9 | 256 | 1024 | | **[ogma-base](https://huggingface.co/axiotic/ogma-base)** | 13.3M | 51 MB | **57.04** | 67.89 | 41.49 | 83.73 | 51.25 | 42.36 | 82.84 | 29.73 | 256 | 1024 | | **[ogma-small](https://huggingface.co/axiotic/ogma-small)** | 8.6M | 33 MB | 56.34 | 66.67 | 40.69 | 82.91 | 50.51 | 42.05 | 82.00 | 29.59 | 256 | 1024 | | **[ogma-mini](https://huggingface.co/axiotic/ogma-mini)** | 3.5M | 14 MB | 53.07 | 61.80 | 37.38 | 79.66 | 47.39 | 36.21 | 77.71 | 31.33 | 256 | 1024 | | **[ogma-micro](https://huggingface.co/axiotic/ogma-micro)** | 2.3M | 8.9 MB | 52.19 | 59.57 | 36.88 | 78.62 | 49.74 | 33.09 | 75.63 | 31.77 | 128 | 1024 | | *all-MiniLM-L6-v2* | 22.7M | 87 MB | *56.09* | 62.62 | 41.94 | 82.37 | 58.04 | 41.95 | 78.90 | 30.81 | 384 | 256 | | *potion-base-32M* | 32.3M | 123 MB | *51.66* | 65.97 | 35.29 | 78.17 | 50.92 | 33.52 | 74.22 | 29.78 | 256 | inf | | *potion-base-8M* | 7.6M | 29 MB | *50.03* | 64.44 | 32.93 | 76.62 | 49.73 | 31.71 | 73.24 | 29.28 | 256 | inf | All Ogma: MTEB 2.10.7, 54-task standard English set, category-averaged. MiniLM/Potion: published scores from [Model2Vec results page](https://github.com/MinishLab/model2vec/blob/main/results/README.md). --- ## Training Details | Property | Value | |---|---| | Teacher model | `jinaai/jina-embeddings-v5-text-small` (CC-BY-NC-4.0) | | Training paradigm | Knowledge distillation from cached teacher embeddings | | Training data | ~7M curated English sentence pairs | | Tokenizer | AlbertTokenizer (SentencePiece, vocab=30,000) | | Embedding initialisation | PCA of teacher embeddings (128d) projected to d_model | | Loss | Distillation + contrastive (balanced schedule) | | Evaluation framework | MTEB 2.10.7 | --- ## Limitations - **No text generation.** Ogma is an encoder-only embedding model. - **English only.** Training data and evaluation are English-only. - **Slower than static models.** Transformer inference is 40-100ร— slower than static models (Potion, Model2Vec) on CPU. The trade-off: contextual understanding and 4ร— longer sequences. - **Non-commercial licence.** Due to distillation from a CC-BY-NC-4.0 teacher, Ogma inherits the NonCommercial restriction. Commercial use requires a separate Jina AI licence or retraining with a permissive teacher (Apache 2.0 compatible models like BGE or E5 can substitute at the cost of a full retraining run). - **Reranking gap.** Ogma lags behind MiniLM-L6-v2 on reranking tasks (category avg delta: -10.6). This is an architectural characteristic: the model optimises for semantic similarity and classification over pairwise ranking. --- ## Licence & Attribution This model is released under **[CC-BY-NC-4.0](https://creativecommons.org/licenses/by-nc/4.0/)** (Creative Commons Attribution-NonCommercial 4.0 International). **Required attribution (must be included in all uses):** > This model was trained via knowledge distillation from > `jina-embeddings-v5-text-small` (https://huggingface.co/jinaai/jina-embeddings-v5-text-small) > by Jina AI, licensed under CC-BY-NC-4.0. --- ## Citation ```bibtex @misc{ogma2026, title = {Ogma: Efficient Dense Retrieval via Structured Embeddings}, author = {Axiotic AI}, year = {2026}, url = {https://huggingface.co/axiotic/ogma-mini}, } ```