Octen-Embedding-0.6B ONNX

ONNX export of Octen/Octen-Embedding-0.6B for inference with ONNX Runtime (Python, Web/WASM, etc.).

  • Base model: Octen/Octen-Embedding-0.6B
  • Pooling: last token, L2-normalized
  • Max sequence length: 512
  • Dynamic batch: True
  • Hidden size: 1024

Files

File Description
model.fp16.onnx (+ .onnx.data) FP16 weights, ~1.1 GB
model.int8.onnx INT8 quantized, ~560 MB
tokenizer/ Hugging Face tokenizer (same as base model)
conversion-metadata.json Export config

Usage

Python (ONNX Runtime)

import numpy as np
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download
import onnxruntime as ort

# Download from this repo
repo_id = "geoffsee/octen-embedding-0.6b-onnx"
path_fp16 = hf_hub_download(repo_id=repo_id, filename="model.fp16.onnx")
path_int8 = hf_hub_download(repo_id=repo_id, filename="model.int8.onnx")
tokenizer_path = hf_hub_download(repo_id=repo_id, filename="tokenizer/tokenizer.json", repo_type="model")

tokenizer = AutoTokenizer.from_pretrained(repo_id)
session = ort.InferenceSession(path_int8, providers=["CPUExecutionProvider"])

encoded = tokenizer("Your text here", return_tensors="np", padding=True, truncation=True, max_length=512)
outputs = session.run(None, {"input_ids": encoded["input_ids"].astype(np.int64), "attention_mask": encoded["attention_mask"].astype(np.int64)})
embeddings = outputs[0]  # (batch, 1024)

JavaScript / ONNX Runtime Web

Use model.fp16.onnx or model.int8.onnx with onnxruntime-web. Load the tokenizer from tokenizer/ (e.g. with a compatible JS tokenizer).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support