MimeLens-001

Pretrained encoders for position-agnostic file-content-type detection — from a byte window taken at any offset.

A family of 36 small (3.15–37.8 M backbone parameter) BERT-style encoders pretrained MLM-only on 33 GB of heterogeneous binary content (binary-30k, magic-corpus extracts, packed binaries, a glaurung-sourced binary corpus, and Windows drivers) for classification under libmagic's 125-class MIME taxonomy. 28 parent-cube cells at seq_len=1024 (4 KB byte windows) plus an 8-cell short-sequence extension at seq_len=256 (1 KB byte windows) sized for sub-MTU packets, DNS payloads, and small forensic fragments.

Training samples 1024-token windows uniformly at random across files and 64 KB fragments, with no privileged "head-of-file" position. A single checkpoint classifies a byte window taken at any offset: a streaming HTTP body before upload completes, a forensic-carved fragment with no recoverable header, a random seek into a multi-gigabyte container, or a packet payload inspected mid-stream.

Existing learned classifiers (Magika, libmagic) and signature-based forensic tools (TrID, Siegfried/DROID) are designed and trained for whole-file access at a known offset. MimeLens fills a different point on the deployment surface: libmagic's 125-class taxonomy + position-arbitrary inputs, at the cost of CPU latency (roughly two orders of magnitude slower than Magika at the medium size; hardware-dependent).

📄 Paper: MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments (Bommarito 2026). 💻 Training code — model, MLM pretraining loop, and data layer (reference implementation).

Which cell should you use?

What's your input?

├── A complete file and sub-millisecond CPU latency matters
│   └─→ Magika is purpose-built for this. Reach for MimeLens only if
│       libmagic's 125-class taxonomy is required.
│
├── A partial / streaming / packet-payload / random-offset chunk (≥ 1 KB)
│   └─→ mimelens-001-medium-byte-s1     (saturates from a single 1.4 KB packet)
│
├── A sub-KB chunk (sub-MTU packet, DNS payload, small fragment, ≤ 1 KB total)
│   ├─→ mimelens-001-medium-bpe-64k-s1-seq256  (best short-seq accuracy: 0.985 / 0.981 at 4 KB / 256 B head; ONNX fp32 + int8 bundled)
│   ├─→ mimelens-001-medium-bpe-16k-s1-seq256  (smaller weights; ONNX fp32 + int8 bundled; ~10× faster end-to-end than parent cube)
│   └─→ mimelens-001-medium-byte-s1-seq256     (ONNX fp32 + int8 bundled; natural pick for cleartext UDP / streaming where a single packet ~= 256 B of body)
│
├── A clean 4 KB head and you want libmagic-style 125-class MIME labels
│   ├─→ mimelens-001-medium-bpe-16k-s1  (recommended default; balances accuracy + adversarial robustness)
│   └─→ mimelens-001-medium-byte-s1     (essentially tied under clean conditions)
│
├── Header-corruption-prone inputs (packed binaries, truncated streams)
│   └─→ mimelens-001-medium-bpe-64k-s1  (most robust under header-perturbation attack)
│
└── Latency-bounded large-scale indexing (offline, batched)
    └─→ mimelens-001-medium-bpe-64k-s1  (~2.09× throughput over the byte cell at same hardware)

Three deployed cells (the headline)

These are the three cells the paper presents as the deployable system. Each is one configuration of the medium tier (37.76 M backbone parameters, 12 layers, hidden 512, 8 heads). They differ only in input pipeline.

Cell	Vocab	Bytes covered (of 4 KB)	Repo
`medium/byte/s1`	256-byte (id == byte value)	≈ 1.0 KB (1,022 B)	`mjbommar/mimelens-001-medium-byte-s1`
`medium/bpe-16k/s1`	binary-BPE 16,384	≈ 1.8 KB (1,765 B)	`mjbommar/mimelens-001-medium-bpe-16k-s1` (ONNX bundled)
`medium/bpe-64k/s1`	binary-BPE 65,536	≈ 2.1 KB (2,134 B)	`mjbommar/mimelens-001-medium-bpe-64k-s1`

All three load via AutoModel.from_pretrained(..., trust_remote_code=True). See the per-cell READMEs for the copy-pasteable inference snippet.

All released cells

Every cell of the 3 × 4 × {2,3} factorial parent cube is published, plus an 8-cell short-sequence extension at the medium tier (seq_len=256) and one matched-tokens ablation. Numbers in the parent-cube tables are this-cell magic-frags 4 KB-head top-1 / macro-F1 / kNN R@1 — the within-cube benchmark applied identically to all 28 parent-cube cells (short-sequence numbers use magic-files probe-fit; see that subsection). The medium/bpe-16k/s1 headline calibration numbers against Magika (0.828 strict / 0.829 aligned / 0.927 top-level on the magic-files n=1,024 held-out split) are shown in Headline findings below.

medium (37.76 M backbone params; the recommended size for deployment)

Cell	top-1	macro-F1	kNN R@1	Repo
`medium/byte/s1`	0.813	0.686	0.760	link
`medium/byte/s2`	0.787	0.654	0.723	link
`medium/byte/s3`	0.798	0.640	0.733	link
`medium/bpe-4k/s1`	0.793	0.678	0.698	link
`medium/bpe-4k/s2`	0.781	0.655	0.691	link
`medium/bpe-4k/s3`	0.775	0.639	0.683	link
`medium/bpe-16k/s1`	0.799	0.637	0.699	link (ONNX)
`medium/bpe-16k/s2`	0.809	0.683	0.717	link
`medium/bpe-16k/s3`	0.817	0.681	0.707	link
`medium/bpe-64k/s1`	0.727	0.624	0.662	link
`medium/bpe-64k/s2`	0.748	0.627	0.660	link
`medium/bpe-64k/s3`	0.740	0.598	0.675	link
`medium/bpe-64k/matched-tokens`	(ablation)	—	—	link (47,808-step matched-tokens-seen control; see paper Appendix A)

small (14.16 M backbone params)

Cell	top-1	macro-F1	kNN R@1	Repo
`small/byte/s1`	0.776	0.623	0.719	link
`small/byte/s2`	0.766	0.617	0.719	link
`small/bpe-4k/s1`	0.759	0.646	0.706	link
`small/bpe-4k/s2`	0.756	0.625	0.685	link
`small/bpe-16k/s1`	0.747	0.630	0.692	link
`small/bpe-16k/s2`	0.756	0.639	0.688	link
`small/bpe-64k/s1`	0.797	0.698	0.719	link
`small/bpe-64k/s2`	0.787	0.619	0.710	link

tiny (3.15 M backbone params; scaling-study completeness, not recommended for deployment)

Cell	top-1	macro-F1	kNN R@1	Repo
`tiny/byte/s1`	0.740	0.602	0.685	link
`tiny/byte/s2`	0.739	0.593	0.679	link
`tiny/bpe-4k/s1`	0.726	0.608	0.674	link
`tiny/bpe-4k/s2`	0.739	0.620	0.687	link
`tiny/bpe-16k/s1`	0.757	0.637	0.697	link
`tiny/bpe-16k/s2`	0.737	0.646	0.679	link
`tiny/bpe-64k/s1`	0.715	0.620	0.671	link
`tiny/bpe-64k/s2`	0.732	0.609	0.675	link

medium short-sequence (seq_len=256, for sub-MTU packets and small forensic fragments)

Matched-steps to the parent cube (22,888 gradient updates, same architecture, optimizer, schedule). Numbers are magic-files 4 KB-head probe-fit top-1 (left) and 256 B-head probe-fit top-1 (right; the design regime). BPE cells preserve or exceed parent accuracy at 4× lower per-step token budget; the byte cell pays ~1 pp.

Cell	4 KB head	256 B head	Repo
`medium/byte/s1-seq256`	0.947	0.947	link (ONNX bundled)
`medium/byte/s2-seq256`	0.943	0.943	link
`medium/bpe-4k/s1-seq256`	0.971	0.967	link
`medium/bpe-4k/s2-seq256`	0.972	0.967	link
`medium/bpe-16k/s1-seq256`	0.980	0.974	link (ONNX bundled)
`medium/bpe-16k/s2-seq256`	0.981	0.975	link
`medium/bpe-64k/s1-seq256`	0.987	0.983	link (ONNX bundled)
`medium/bpe-64k/s2-seq256`	0.986	0.979	link

Per-vocab seed means at 4 KB head: byte 0.945, bpe-4k 0.971, bpe-16k 0.980, bpe-64k 0.987 (vs parent cube 0.955 / 0.973 / 0.977 / 0.975 at the same probe-fit metric, matched-steps).

Headline findings (from the paper)

Calibrated against Magika v1.1 on the same n=1,024 held-out 4,096-file split, libmagic-pinned ground truth: medium/bpe-16k/s1 exceeds Magika at every level of stringency. Strict top-1: 0.828 vs 0.653 (+17.5 pp). Aligned under a curated 21-class equivalence map applied symmetrically to both systems: 0.829 vs 0.722 (+10.7 pp). Top-level (text vs image vs application vs …): 0.927 vs 0.840 (+8.7 pp). The aligned gap is the residual under this map on this corpus; what would persist under a hypothetically retrained Magika is open.
Adversarial header perturbation. Under directed perturbations of the 4 KB head (zero-first-{4,16,64} bytes, random-first-4), bpe-64k drops the least (2–7 pp). byte and bpe-16k drop 2–16 pp. The worst clean cell is the most robust under attack.
Real captured UDP traffic (Section 5 of the paper): 500 files transmitted as UDP datagrams (1448 B payload) over loopback, captured with tcpdump, classified cumulatively from the raw pcap (n=498 streams after capture-layer drops). medium/byte/s1 reaches 0.855 top-1 from a single 1.4 KB packet and is flat through K=all (a 1024-token byte cell already consumes the full first packet). For comparison on the same captured bytes: Magika 0.618 entire stream, libmagic 0.791 at K=all, TrID 0.729 self-consistency at K=all (TrID's label space is not directly comparable to libmagic-125; see paper).
Truly random-offset disk-block classification (Section 6): a 1 GB unmounted ext4 image populated with 3,066 MIME-balanced files; 1,000 random 4 KB block reads. On 980 mid-file blocks, all three medium cells exceed both libmagic and Magika with non-overlapping file-level cluster-bootstrap CIs: medium/bpe-64k/s1 0.266, medium/bpe-16k/s1 0.220, medium/byte/s1 0.219, vs libmagic 0.093 / Magika 0.112. Replicates across 9 matrix cells (ext4 × 4 init-strategies + 2 size-stratified sub-cells; NTFS × 3 init-strategies).
CPU latency. Idle Intel i9-12900K, single sample, p50: parent-cube medium/bpe-16k/s1 202 ms (PyTorch fp32), 382 ms (ONNX dynamic int8); Magika v1.1 ~~1.3 ms (~~155×). Dynamic int8 is slower than fp32 at seq_len=1024 on two AVX-VNNI CPUs (an i9-12900K and a Ryzen 7 7840HS) — intrinsic to dynamic quantization at this context length, not a missing-VNNI effect; at seq_len=256 int8 is faster. The short-sequence medium/*-seq256 int8 cells drop to ~~26 ms (~~20× Magika per sample). Latency is hardware-dependent. MimeLens occupies a different point on the deployment surface than Magika, not a drop-in replacement.

Full evaluation (within-cube bootstrap CIs at n=3 medium seeds, calibration, per-class breakdown, network curves, baseline comparisons against libmagic 5.46 and TrID 2.24, byte-coverage matched ablation) is in the paper.

Released artifacts beyond model weights

Training code: https://github.com/mjbommar/mimelens-training — Rust + Python data layer, model, and MLM training stack (a reference implementation; the 33 GB corpus is not redistributable).
Tokenizers (canonical source): mjbommar/binary-tokenizer-001-{4k,8k,16k,32k,64k} HuggingFace repos.
Paper: MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments (Bommarito 2026); see the per-cell model-card citation.

Citation

@misc{bommarito2026mimelens,
  title  = {MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments},
  author = {Bommarito II, Michael J.},
  year   = {2026},
  note   = {https://github.com/mjbommar/mimelens-training},
}

License

MIT.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support