mimelens-001-medium-bpe-64k-s1

A 37.76M-backbone-parameter BERT-style encoder for position-agnostic file-content-type detection from binary data. It reads a byte window taken from any offset in a file (the first ~1{,}022 tokens of whatever you pass) and produces a 512-dimensional embedding that classifiers map to one of libmagic's 125 MIME labels. Designed for inputs where you only have a chunk: a forensic-carved fragment, a random disk-block read, a streaming HTTP upload, a single network packet payload.

🔗 Model: mjbommar/mimelens-001-medium-bpe-64k-s1
👥 Family: mjbommar/mimelens-001 (36 released cells: 28 parent + 8 short-sequence)
🔤 Tokenizer: mjbommar/binary-tokenizer-001-64k
📄 Paper: MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments (Bommarito 2026)
💻 Training code: mjbommar/mimelens-training
📊 Pretraining corpus: mjbommar/binary-30k-tokenized plus magic-corpus extracts, packed binaries, a glaurung-sourced binary corpus, and Windows drivers (33 GB stratified; the full corpus is not redistributable)

What MimeLens does

MimeLens classifies file content type from a byte window taken at any offset, not just the header of a complete file.

Existing tools assume whole-file access at a known offset:

libmagic and Apache Tika match handcrafted magic-byte signatures, almost always anchored at the file head.
Magika (Google) is a small (~1 M-parameter) feedforward network over three 512-byte windows (head, middle, tail) of a known-bounded file.
TrID, PRONOM/Siegfried/DROID similarly require a complete file.

These break down on a fragment. MimeLens is pretrained MLM-only on 1024-token windows sampled uniformly at random across files and 64 KB fragments, with no privileged head-of-file position. One checkpoint handles streaming, partial-arrival, mid-file, packet-payload, and forensic-carved inputs uniformly. The trade-off is CPU latency (roughly two orders of magnitude slower than Magika at the medium size; hardware-dependent) in exchange for libmagic's 125-class taxonomy plus position arbitrariness.

The family ships 28 parent cells (3 sizes × 4 vocabs × 2-3 seeds at seq_len=1024) plus an 8-cell short-sequence extension (medium tier × 4 vocabs × 2 seeds at seq_len=256). This README documents one of them.

Short-sequence sibling available. If your inputs are sub-KB (DNS payloads, sub-MTU packets, small forensic fragments), use mjbommar/mimelens-001-medium-bpe-64k-s1-seq256 instead. Same architecture, 4× shorter context, ~5× lower CPU latency, BPE-cell accuracy ties or beats this cell on the magic-files probe-fit. See paper Appendix B.5.

Overview

This cell: medium tier, bpe-64k input pipeline, seed 1
Backbone: 37.76M parameters (12 layers, hidden 512, 8 attention heads, head dim 64, RoPE, RMSNorm, no biases, no dropout)
Input vocabulary: bpe-64k. 65,536-entry binary BPE tokenizer (binary-tokenizer-001-64k), ~2.09 bytes/token. Reads ~2,134 bytes of the 4 KB buffer.
Output: 512-dim mean-pooled body-token embedding
Label space: libmagic 125-class MIME taxonomy (full list in paper Appendix)
Pretraining: MLM-only, 30% mask ratio, 33 GB stratified multi-source binary corpus, 22,888 gradient updates, single RTX 4060 Ti, ~18.0 h wall-clock
License: MIT

Headline benchmarks (this cell)

Benchmark	Value
MIME-125 top-1 (magic-frags, 4 KB head, n=4,096)	0.727
MIME-125 macro-F1 (magic-frags, 4 KB head)	0.624
kNN R@1 (magic-frags, 3,147-file gallery / 949 queries)	0.662
Δ top-1 under zero-first-16-byte header perturbation	−0.034
Δ top-1 under zero-first-64-byte header perturbation	−0.069
Disk-block top-1: 980 mid-file blocks, ext4 / seq-bulk	0.266 (vs libmagic 0.093, Magika 0.112; point-estimate leader across all 9 disk-matrix cells)

Full evaluation (within-cube bootstrap CIs, adversarial sweep, calibration, real-network curves, disk-block matrix, baselines against libmagic 5.46 and TrID 2.24) is in the paper.

Quick start

This cell ships a 125-class libmagic-MIME classifier head (the paper's LR probe, re-fit on the full magic-files corpus), so pipeline("text-classification", ...) works out of the box:

from transformers import pipeline

clf = pipeline("text-classification",
               model="mjbommar/mimelens-001-medium-bpe-64k-s1",
               trust_remote_code=True,
               top_k=5)

# The model reads the first ~1,022 tokens of whatever you pass (a prefix of the
# buffer, not the whole window). For whole-file triage, a short head window
# classifies magic-byte / compressed types better than a long one -- see
# "Choosing a window" below.
window = open("path/to/file", "rb").read(4096)
preds  = clf(window.decode("latin-1"))                 # latin-1 is a bijection over bytes
# preds[0] is the list of {label, score} sorted by score:
# [{"label": "image/png", "score": 0.97}, {"label": "image/jpeg", "score": 0.01}, ...]

To work with embeddings directly (fit a probe, kNN over a gallery, fine-tune a head):

import torch
from transformers import AutoModel, AutoTokenizer

repo  = "mjbommar/mimelens-001-medium-bpe-64k-s1"
model = AutoModel.from_pretrained(repo, trust_remote_code=True).eval()
tok   = AutoTokenizer.from_pretrained(repo)

window = open("path/to/file", "rb").read(4096)
inputs = tok(window.decode("latin-1"), max_length=1024, truncation=True,
             padding="max_length", return_tensors="pt")
with torch.no_grad():
    embedding = model(**inputs).pooler_output         # (1, 512)

Choosing a window

The model reads the first ~1{,}022 tokens of whatever you pass — a prefix of the buffer (for this BPE cell, whatever tokenizes to ~1{,}022 tokens, typically the first ~1.5--2.5 KB), not the whole window.

Magic-byte / compressed types (PNG, ZIP, GZIP, JPEG): a short head window (256 B--1 KB) classifies better than 4 KB. A long high-entropy body dilutes the header signal within the fixed token budget, and the model returns application/octet-stream on a mostly-opaque window — correct behaviour for genuinely high-entropy input, not a bug.
Fragments / packets: you cannot choose the offset, so pass what you have. This is the regime MimeLens is built for.

Recommended deployment regimes

Header-corruption-prone inputs: drops only 2-7 pp under directed header perturbations, vs 8-16 pp for byte/bpe-16k.
Latency-bounded large-scale indexing: 2.08× CPU throughput over byte at the same backbone compute.

Training

This cell is one point of the 3 × 4 × {2,3} factorial cube described in the paper.

Corpus (33 GB, stratified multi-source): binary-30k (assorted ELF/PE/Mach-O), magic-frags (random 64 KB chunks across libmagic's full corpus), assorted packed/raw binaries, a glaurung-sourced binary corpus, Windows drivers.
Position-arbitrary windowing: 1024-token windows sampled uniformly at random across files and 64 KB fragments. No privileged "head of file" position. This is the design choice that makes MimeLens work on streaming / partial / random-offset inputs.
Objective: MLM with 30% mask ratio (BERT replacement schedule: 80% [MASK], 10% random, 10% original); tied input/output embeddings.
Pooling: mean-pool over body tokens for downstream tasks. The BERT-style cls_pool linear projection is not used: under MLM-only training it receives no gradient and remains byte-identical to its random initialisation across all 28 cube cells (paper §3.4 verifies this; left in the saved weights for architectural completeness only).
Optimisation: AdamW + cosine LR (peak 5e-4, 2,000-step warmup, 10% floor), bf16 mixed precision, gradient clipping at $|g|_2 \leq 1$, effective batch 128 at sequence length 1024, 22,888 gradient updates.
Hardware: single RTX 4060 Ti (16 GB), ~18.0 h wall-clock for this cell.

Caveats

This is one cell of a 28-cell parent cube (36 released cells including the 8-cell short-sequence extension). Within-cube comparisons in the paper carry bootstrap CIs at n=3 seeds; some marginal orderings (byte vs bpe-16k at the largest size) are within seed noise and should be read as ties.
The training corpus is one 33 GB stratified multi-source binary sample. Results may not transfer to substantially different corpora.
All numbers are computed on data labelled by a single pipeline (libmagic-pinned). Cross-validation against PRONOM, Siegfried, DROID, or IANA reference files is a documented limitation.
CPU latency at the medium size is ~155× slower than Magika v1.1 on a desktop CPU (hardware-dependent). For sub-millisecond whole-file triage on broad categories, Magika is purpose-built and is the right tool. MimeLens occupies a different point on the deployment surface (position-arbitrary inputs + libmagic's 125-class taxonomy), not a drop-in replacement.
End-to-end fine-tuning on the production label distribution may shift these numbers and should be evaluated before deployment. The frozen-probe numbers above are not claimed as a lower bound on fine-tuned performance.

Citation

@misc{bommarito2026mimelens,
  title  = {MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments},
  author = {Bommarito II, Michael J.},
  year   = {2026},
  note   = {https://github.com/mjbommar/mimelens-training},
}

Downloads last month: 46

Safetensors

Model size

71.5M params

Tensor type

F32

Model tree for mjbommar/mimelens-001-medium-bpe-64k-s1

Base model

mjbommar/binary-tokenizer-001-64k

Finetuned

(9)

this model

Collection including mjbommar/mimelens-001-medium-bpe-64k-s1

MimeLens

Collection

MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments • 6 items • Updated 1 day ago

Evaluation results

top-1 accuracy on magic-frags (4 KB head of 64 KB random chunks, n=4,096)
MimeLens paper (Bommarito 2026), Appendix A

0.727
macro-F1 on magic-frags (4 KB head of 64 KB random chunks, n=4,096)
MimeLens paper (Bommarito 2026), Appendix A

0.624
kNN R@1 on magic-frags (4 KB head of 64 KB random chunks, n=4,096)
MimeLens paper (Bommarito 2026), Appendix A

0.662