Instructions to use mjbommar/mimelens-001 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mjbommar/mimelens-001 with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("mjbommar/mimelens-001", dtype="auto") - Notebooks
- Google Colab
- Kaggle
MimeLens-001
Pretrained encoders for position-agnostic file-content-type detection β from a byte window taken at any offset.
A family of 36 small (3.15β37.8 M backbone parameter) BERT-style encoders pretrained MLM-only on 33 GB of heterogeneous binary content (binary-30k, magic-corpus extracts, packed binaries, a glaurung-sourced binary corpus, and Windows drivers) for classification under libmagic's 125-class MIME taxonomy. 28 parent-cube cells at seq_len=1024 (4 KB byte windows) plus an 8-cell short-sequence extension at seq_len=256 (1 KB byte windows) sized for sub-MTU packets, DNS payloads, and small forensic fragments.
Training samples 1024-token windows uniformly at random across files and 64 KB fragments, with no privileged "head-of-file" position. A single checkpoint classifies a byte window taken at any offset: a streaming HTTP body before upload completes, a forensic-carved fragment with no recoverable header, a random seek into a multi-gigabyte container, or a packet payload inspected mid-stream.
Existing learned classifiers (Magika, libmagic) and signature-based forensic tools (TrID, Siegfried/DROID) are designed and trained for whole-file access at a known offset. MimeLens fills a different point on the deployment surface: libmagic's 125-class taxonomy + position-arbitrary inputs, at the cost of CPU latency (roughly two orders of magnitude slower than Magika at the medium size; hardware-dependent).
π Paper: MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments (Bommarito 2026). π» Training code β model, MLM pretraining loop, and data layer (reference implementation).
Which cell should you use?
What's your input?
βββ A complete file and sub-millisecond CPU latency matters
β βββ Magika is purpose-built for this. Reach for MimeLens only if
β libmagic's 125-class taxonomy is required.
β
βββ A partial / streaming / packet-payload / random-offset chunk (β₯ 1 KB)
β βββ mimelens-001-medium-byte-s1 (saturates from a single 1.4 KB packet)
β
βββ A sub-KB chunk (sub-MTU packet, DNS payload, small fragment, β€ 1 KB total)
β βββ mimelens-001-medium-bpe-64k-s1-seq256 (best short-seq accuracy: 0.985 / 0.981 at 4 KB / 256 B head; ONNX fp32 + int8 bundled)
β βββ mimelens-001-medium-bpe-16k-s1-seq256 (smaller weights; ONNX fp32 + int8 bundled; ~10Γ faster end-to-end than parent cube)
β βββ mimelens-001-medium-byte-s1-seq256 (ONNX fp32 + int8 bundled; natural pick for cleartext UDP / streaming where a single packet ~= 256 B of body)
β
βββ A clean 4 KB head and you want libmagic-style 125-class MIME labels
β βββ mimelens-001-medium-bpe-16k-s1 (recommended default; balances accuracy + adversarial robustness)
β βββ mimelens-001-medium-byte-s1 (essentially tied under clean conditions)
β
βββ Header-corruption-prone inputs (packed binaries, truncated streams)
β βββ mimelens-001-medium-bpe-64k-s1 (most robust under header-perturbation attack)
β
βββ Latency-bounded large-scale indexing (offline, batched)
βββ mimelens-001-medium-bpe-64k-s1 (~2.09Γ throughput over the byte cell at same hardware)
Three deployed cells (the headline)
These are the three cells the paper presents as the deployable system. Each is one configuration of the medium tier (37.76 M backbone parameters, 12 layers, hidden 512, 8 heads). They differ only in input pipeline.
| Cell | Vocab | Bytes covered (of 4 KB) | Repo |
|---|---|---|---|
medium/byte/s1 |
256-byte (id == byte value) | β 1.0 KB (1,022 B) | mjbommar/mimelens-001-medium-byte-s1 |
medium/bpe-16k/s1 |
binary-BPE 16,384 | β 1.8 KB (1,765 B) | mjbommar/mimelens-001-medium-bpe-16k-s1 (ONNX bundled) |
medium/bpe-64k/s1 |
binary-BPE 65,536 | β 2.1 KB (2,134 B) | mjbommar/mimelens-001-medium-bpe-64k-s1 |
All three load via AutoModel.from_pretrained(..., trust_remote_code=True). See the per-cell READMEs for the copy-pasteable inference snippet.
All released cells
Every cell of the 3 Γ 4 Γ {2,3} factorial parent cube is published, plus an 8-cell short-sequence extension at the medium tier (seq_len=256) and one matched-tokens ablation. Numbers in the parent-cube tables are this-cell magic-frags 4 KB-head top-1 / macro-F1 / kNN R@1 β the within-cube benchmark applied identically to all 28 parent-cube cells (short-sequence numbers use magic-files probe-fit; see that subsection). The medium/bpe-16k/s1 headline calibration numbers against Magika (0.828 strict / 0.829 aligned / 0.927 top-level on the magic-files n=1,024 held-out split) are shown in Headline findings below.
medium (37.76 M backbone params; the recommended size for deployment)
| Cell | top-1 | macro-F1 | kNN R@1 | Repo |
|---|---|---|---|---|
medium/byte/s1 |
0.813 | 0.686 | 0.760 | link |
medium/byte/s2 |
0.787 | 0.654 | 0.723 | link |
medium/byte/s3 |
0.798 | 0.640 | 0.733 | link |
medium/bpe-4k/s1 |
0.793 | 0.678 | 0.698 | link |
medium/bpe-4k/s2 |
0.781 | 0.655 | 0.691 | link |
medium/bpe-4k/s3 |
0.775 | 0.639 | 0.683 | link |
medium/bpe-16k/s1 |
0.799 | 0.637 | 0.699 | link (ONNX) |
medium/bpe-16k/s2 |
0.809 | 0.683 | 0.717 | link |
medium/bpe-16k/s3 |
0.817 | 0.681 | 0.707 | link |
medium/bpe-64k/s1 |
0.727 | 0.624 | 0.662 | link |
medium/bpe-64k/s2 |
0.748 | 0.627 | 0.660 | link |
medium/bpe-64k/s3 |
0.740 | 0.598 | 0.675 | link |
medium/bpe-64k/matched-tokens |
(ablation) | β | β | link (47,808-step matched-tokens-seen control; see paper Appendix A) |
small (14.16 M backbone params)
| Cell | top-1 | macro-F1 | kNN R@1 | Repo |
|---|---|---|---|---|
small/byte/s1 |
0.776 | 0.623 | 0.719 | link |
small/byte/s2 |
0.766 | 0.617 | 0.719 | link |
small/bpe-4k/s1 |
0.759 | 0.646 | 0.706 | link |
small/bpe-4k/s2 |
0.756 | 0.625 | 0.685 | link |
small/bpe-16k/s1 |
0.747 | 0.630 | 0.692 | link |
small/bpe-16k/s2 |
0.756 | 0.639 | 0.688 | link |
small/bpe-64k/s1 |
0.797 | 0.698 | 0.719 | link |
small/bpe-64k/s2 |
0.787 | 0.619 | 0.710 | link |
tiny (3.15 M backbone params; scaling-study completeness, not recommended for deployment)
| Cell | top-1 | macro-F1 | kNN R@1 | Repo |
|---|---|---|---|---|
tiny/byte/s1 |
0.740 | 0.602 | 0.685 | link |
tiny/byte/s2 |
0.739 | 0.593 | 0.679 | link |
tiny/bpe-4k/s1 |
0.726 | 0.608 | 0.674 | link |
tiny/bpe-4k/s2 |
0.739 | 0.620 | 0.687 | link |
tiny/bpe-16k/s1 |
0.757 | 0.637 | 0.697 | link |
tiny/bpe-16k/s2 |
0.737 | 0.646 | 0.679 | link |
tiny/bpe-64k/s1 |
0.715 | 0.620 | 0.671 | link |
tiny/bpe-64k/s2 |
0.732 | 0.609 | 0.675 | link |
medium short-sequence (seq_len=256, for sub-MTU packets and small forensic fragments)
Matched-steps to the parent cube (22,888 gradient updates, same architecture, optimizer, schedule). Numbers are magic-files 4 KB-head probe-fit top-1 (left) and 256 B-head probe-fit top-1 (right; the design regime). BPE cells preserve or exceed parent accuracy at 4Γ lower per-step token budget; the byte cell pays ~1 pp.
| Cell | 4 KB head | 256 B head | Repo |
|---|---|---|---|
medium/byte/s1-seq256 |
0.947 | 0.947 | link (ONNX bundled) |
medium/byte/s2-seq256 |
0.943 | 0.943 | link |
medium/bpe-4k/s1-seq256 |
0.971 | 0.967 | link |
medium/bpe-4k/s2-seq256 |
0.972 | 0.967 | link |
medium/bpe-16k/s1-seq256 |
0.980 | 0.974 | link (ONNX bundled) |
medium/bpe-16k/s2-seq256 |
0.981 | 0.975 | link |
medium/bpe-64k/s1-seq256 |
0.987 | 0.983 | link (ONNX bundled) |
medium/bpe-64k/s2-seq256 |
0.986 | 0.979 | link |
Per-vocab seed means at 4 KB head: byte 0.945, bpe-4k 0.971, bpe-16k 0.980, bpe-64k 0.987 (vs parent cube 0.955 / 0.973 / 0.977 / 0.975 at the same probe-fit metric, matched-steps).
Headline findings (from the paper)
Calibrated against Magika v1.1 on the same n=1,024 held-out 4,096-file split, libmagic-pinned ground truth:
medium/bpe-16k/s1exceeds Magika at every level of stringency. Strict top-1: 0.828 vs 0.653 (+17.5 pp). Aligned under a curated 21-class equivalence map applied symmetrically to both systems: 0.829 vs 0.722 (+10.7 pp). Top-level (text vs image vs application vs β¦): 0.927 vs 0.840 (+8.7 pp). The aligned gap is the residual under this map on this corpus; what would persist under a hypothetically retrained Magika is open.Adversarial header perturbation. Under directed perturbations of the 4 KB head (zero-first-{4,16,64} bytes, random-first-4),
bpe-64kdrops the least (2β7 pp).byteandbpe-16kdrop 2β16 pp. The worst clean cell is the most robust under attack.Real captured UDP traffic (Section 5 of the paper): 500 files transmitted as UDP datagrams (1448 B payload) over loopback, captured with
tcpdump, classified cumulatively from the raw pcap (n=498 streams after capture-layer drops).medium/byte/s1reaches 0.855 top-1 from a single 1.4 KB packet and is flat through K=all (a 1024-token byte cell already consumes the full first packet). For comparison on the same captured bytes: Magika 0.618 entire stream, libmagic 0.791 at K=all, TrID 0.729 self-consistency at K=all (TrID's label space is not directly comparable to libmagic-125; see paper).Truly random-offset disk-block classification (Section 6): a 1 GB unmounted
ext4image populated with 3,066 MIME-balanced files; 1,000 random 4 KB block reads. On 980 mid-file blocks, all three medium cells exceed both libmagic and Magika with non-overlapping file-level cluster-bootstrap CIs:medium/bpe-64k/s10.266,medium/bpe-16k/s10.220,medium/byte/s10.219, vs libmagic 0.093 / Magika 0.112. Replicates across 9 matrix cells (ext4 Γ 4 init-strategies + 2 size-stratified sub-cells; NTFS Γ 3 init-strategies).CPU latency. Idle Intel i9-12900K, single sample, p50: parent-cube
medium/bpe-16k/s1202 ms (PyTorch fp32), 382 ms (ONNX dynamic int8); Magika v1.11.3 ms (155Γ). Dynamic int8 is slower than fp32 atseq_len=1024on two AVX-VNNI CPUs (an i9-12900K and a Ryzen 7 7840HS) β intrinsic to dynamic quantization at this context length, not a missing-VNNI effect; atseq_len=256int8 is faster. The short-sequencemedium/*-seq256int8 cells drop to26 ms (20Γ Magika per sample). Latency is hardware-dependent. MimeLens occupies a different point on the deployment surface than Magika, not a drop-in replacement.
Full evaluation (within-cube bootstrap CIs at n=3 medium seeds, calibration, per-class breakdown, network curves, baseline comparisons against libmagic 5.46 and TrID 2.24, byte-coverage matched ablation) is in the paper.
Released artifacts beyond model weights
- Training code: https://github.com/mjbommar/mimelens-training β Rust + Python data layer, model, and MLM training stack (a reference implementation; the 33 GB corpus is not redistributable).
- Tokenizers (canonical source):
mjbommar/binary-tokenizer-001-{4k,8k,16k,32k,64k}HuggingFace repos. - Paper: MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments (Bommarito 2026); see the per-cell model-card citation.
Citation
@misc{bommarito2026mimelens,
title = {MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments},
author = {Bommarito II, Michael J.},
year = {2026},
note = {https://github.com/mjbommar/mimelens-training},
}
License
MIT.