mimelens-001-small-bpe-16k-s2

A 14.16M-backbone-parameter BERT-style encoder for position-agnostic file-content-type detection from binary data. It reads a byte window taken from any offset in a file (the first ~1{,}022 tokens of whatever you pass) and produces a 384-dimensional embedding that classifiers map to one of libmagic's 125 MIME labels. Designed for inputs where you only have a chunk: a forensic-carved fragment, a random disk-block read, a streaming HTTP upload, a single network packet payload.


What MimeLens does

MimeLens classifies file content type from a byte window taken at any offset, not just the header of a complete file.

Existing tools assume whole-file access at a known offset:

  • libmagic and Apache Tika match handcrafted magic-byte signatures, almost always anchored at the file head.
  • Magika (Google) is a small (~1 M-parameter) feedforward network over three 512-byte windows (head, middle, tail) of a known-bounded file.
  • TrID, PRONOM/Siegfried/DROID similarly require a complete file.

These break down on a fragment. MimeLens is pretrained MLM-only on 1024-token windows sampled uniformly at random across files and 64 KB fragments, with no privileged head-of-file position. One checkpoint handles streaming, partial-arrival, mid-file, packet-payload, and forensic-carved inputs uniformly. The trade-off is CPU latency (roughly two orders of magnitude slower than Magika at the medium size; hardware-dependent) in exchange for libmagic's 125-class taxonomy plus position arbitrariness.

The family ships 28 parent cells (3 sizes × 4 vocabs × 2-3 seeds at seq_len=1024) plus an 8-cell short-sequence extension (medium tier × 4 vocabs × 2 seeds at seq_len=256). This README documents one of them.

Short-sequence sibling available. If your inputs are sub-KB (DNS payloads, sub-MTU packets, small forensic fragments), use mjbommar/mimelens-001-small-bpe-16k-s2-seq256 instead. Same architecture, 4× shorter context, ~5× lower CPU latency, BPE-cell accuracy ties or beats this cell on the magic-files probe-fit. See paper Appendix B.5.


Overview

  • This cell: small tier, bpe-16k input pipeline, seed 2
  • Backbone: 14.16M parameters (8 layers, hidden 384, 6 attention heads, head dim 64, RoPE, RMSNorm, no biases, no dropout)
  • Input vocabulary: bpe-16k. 16,384-entry binary BPE tokenizer (binary-tokenizer-001-16k), ~1.73 bytes/token. Reads ~1,765 bytes of the 4 KB buffer.
  • Output: 384-dim mean-pooled body-token embedding
  • Label space: libmagic 125-class MIME taxonomy (full list in paper Appendix)
  • Pretraining: MLM-only, 30% mask ratio, 33 GB stratified multi-source binary corpus, 22,888 gradient updates, single RTX 4060 Ti, ~10.7 h wall-clock
  • License: MIT

Headline benchmarks (this cell)

Benchmark Value
MIME-125 top-1 (magic-frags, 4 KB head, n=4,096) 0.756
MIME-125 macro-F1 (magic-frags, 4 KB head) 0.639
kNN R@1 (magic-frags, 3,147-file gallery / 949 queries) 0.688

Full evaluation (within-cube bootstrap CIs, adversarial sweep, calibration, real-network curves, disk-block matrix, baselines against libmagic 5.46 and TrID 2.24) is in the paper.


Quick start

This cell publishes the encoder only (no classifier head baked in). Use it to extract embeddings, then fit a probe, run kNN over a labelled gallery, or fine-tune a head:

import torch
from transformers import AutoModel, AutoTokenizer

repo  = "mjbommar/mimelens-001-small-bpe-16k-s2"
model = AutoModel.from_pretrained(repo, trust_remote_code=True).eval()
tok   = AutoTokenizer.from_pretrained(repo)

window = open("path/to/file", "rb").read(4096)
inputs = tok(window.decode("latin-1"), max_length=1024, truncation=True,
             padding="max_length", return_tensors="pt")
with torch.no_grad():
    embedding = model(**inputs).pooler_output         # (1, 384)

The pre-fit LR probe weights for this cell are not bundled here. The deployed cells and per-size winners (e.g. mimelens-001-medium-bpe-16k-s1) ship a baked classifier head for a one-line pipeline() path.


Choosing a window

The model reads the first ~1{,}022 tokens of whatever you pass — a prefix of the buffer (for this BPE cell, whatever tokenizes to ~1{,}022 tokens, typically the first ~1.5--2.5 KB), not the whole window.

  • Magic-byte / compressed types (PNG, ZIP, GZIP, JPEG): a short head window (256 B--1 KB) classifies better than 4 KB. A long high-entropy body dilutes the header signal within the fixed token budget, and the model returns application/octet-stream on a mostly-opaque window — correct behaviour for genuinely high-entropy input, not a bug.
  • Fragments / packets: you cannot choose the offset, so pass what you have. This is the regime MimeLens is built for.

Recommended deployment regimes

See the family hub README (mjbommar/mimelens-001) for the regime decision tree.


Training

This cell is one point of the 3 × 4 × 2 factorial cube described in the paper.

  • Corpus (33 GB, stratified multi-source): binary-30k (assorted ELF/PE/Mach-O), magic-frags (random 64 KB chunks across libmagic's full corpus), assorted packed/raw binaries, a glaurung-sourced binary corpus, Windows drivers.
  • Position-arbitrary windowing: 1024-token windows sampled uniformly at random across files and 64 KB fragments. No privileged "head of file" position. This is the design choice that makes MimeLens work on streaming / partial / random-offset inputs.
  • Objective: MLM with 30% mask ratio (BERT replacement schedule: 80% [MASK], 10% random, 10% original); tied input/output embeddings.
  • Pooling: mean-pool over body tokens for downstream tasks. The BERT-style cls_pool linear projection is not used: under MLM-only training it receives no gradient and remains byte-identical to its random initialisation across all 28 cube cells (paper §3.4 verifies this; left in the saved weights for architectural completeness only).
  • Optimisation: AdamW + cosine LR (peak 5e-4, 2,000-step warmup, 10% floor), bf16 mixed precision, gradient clipping at $|g|_2 \leq 1$, effective batch 128 at sequence length 1024, 22,888 gradient updates.
  • Hardware: single RTX 4060 Ti (16 GB), ~10.7 h wall-clock for this cell.

Caveats

  • This is one cell of a 28-cell parent cube (36 released cells including the 8-cell short-sequence extension). Within-cube comparisons in the paper carry bootstrap CIs at n=2 seeds; some marginal orderings (byte vs bpe-16k at the largest size) are within seed noise and should be read as ties.
  • The training corpus is one 33 GB stratified multi-source binary sample. Results may not transfer to substantially different corpora.
  • All numbers are computed on data labelled by a single pipeline (libmagic-pinned). Cross-validation against PRONOM, Siegfried, DROID, or IANA reference files is a documented limitation.
  • CPU latency at the medium size is ~155× slower than Magika v1.1 on a desktop CPU (hardware-dependent). For sub-millisecond whole-file triage on broad categories, Magika is purpose-built and is the right tool. MimeLens occupies a different point on the deployment surface (position-arbitrary inputs + libmagic's 125-class taxonomy), not a drop-in replacement.
  • End-to-end fine-tuning on the production label distribution may shift these numbers and should be evaluated before deployment. The frozen-probe numbers above are not claimed as a lower bound on fine-tuned performance.

Citation

@misc{bommarito2026mimelens,
  title  = {MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments},
  author = {Bommarito II, Michael J.},
  year   = {2026},
  note   = {https://github.com/mjbommar/mimelens-training},
}
Downloads last month
34
Safetensors
Model size
20.6M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mjbommar/mimelens-001-small-bpe-16k-s2

Finetuned
(7)
this model

Evaluation results