File size: 9,951 Bytes
d6f1260
 
 
85ca285
 
d6f1260
 
 
 
85ca285
d6f1260
 
85ca285
 
d6f1260
 
85ca285
d6f1260
 
 
 
 
 
 
85ca285
 
d6f1260
 
 
 
 
 
 
 
 
 
 
 
85ca285
d6f1260
 
85ca285
d6f1260
85ca285
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d6f1260
 
 
85ca285
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d6f1260
 
 
85ca285
d6f1260
85ca285
d6f1260
85ca285
d6f1260
 
85ca285
 
d6f1260
85ca285
d6f1260
 
85ca285
d6f1260
 
85ca285
d6f1260
85ca285
d6f1260
85ca285
 
 
 
 
 
d6f1260
 
 
 
 
85ca285
 
d6f1260
 
85ca285
d6f1260
85ca285
 
 
 
 
 
d6f1260
85ca285
d6f1260
85ca285
 
 
d6f1260
85ca285
 
 
 
 
d6f1260
 
 
 
 
85ca285
d6f1260
 
85ca285
d6f1260
85ca285
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
license: mit
library_name: transformers
language:
  - en
tags:
  - file-type-detection
  - mime-classification
  - binary-content
  - binary-analysis
  - position-agnostic
  - libmagic
  - forensics
  - packet-inspection
  - byte-level
  - mimelens
pipeline_tag: text-classification
model-index:
  - name: mimelens-001-tiny-byte-s2
    results:
      - task:
          type: feature-extraction
          name: MIME-125 classification (libmagic 125-class taxonomy)
        dataset:
          name: magic-frags (4 KB head of 64 KB random chunks, n=4,096)
          type: custom
        metrics:
          - name: top-1 accuracy
            type: accuracy
            value: 0.7393
          - name: macro-F1
            type: f1
            value: 0.5929
          - name: kNN R@1
            type: recall@1
            value: 0.6786
        source:
          name: "MimeLens paper (Bommarito 2026), Appendix A"
          url: https://github.com/mjbommar/mimelens-training
---

# mimelens-001-tiny-byte-s2

A 3.15M-backbone-parameter BERT-style encoder for position-agnostic file-content-type detection from binary data. It reads a byte window taken from *any* offset in a file (the first ~1{,}022 tokens of whatever you pass) and produces a 256-dimensional embedding that classifiers map to one of [libmagic](https://github.com/file/file)'s 125 MIME labels. Designed for inputs where you only have a chunk: a forensic-carved fragment, a random disk-block read, a streaming HTTP upload, a single network packet payload.

- **πŸ”— Model**: [`mjbommar/mimelens-001-tiny-byte-s2`](https://huggingface.co/mjbommar/mimelens-001-tiny-byte-s2)
- **πŸ‘₯ Family**: [`mjbommar/mimelens-001`](https://huggingface.co/mjbommar/mimelens-001) (36 released cells: 28 parent + 8 short-sequence)
- **πŸ“„ Paper**: *MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments* (Bommarito 2026)
- **πŸ’» Training code**: [`mjbommar/mimelens-training`](https://github.com/mjbommar/mimelens-training)
- **πŸ“Š Pretraining corpus**: [`mjbommar/binary-30k-tokenized`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized) plus magic-corpus extracts, packed binaries, a [`glaurung`](https://github.com/mjbommar/glaurung)-sourced binary corpus, and Windows drivers (33 GB stratified; the full corpus is not redistributable)

---

## What MimeLens does

MimeLens classifies file content type from a byte window taken at any offset, not just the header of a complete file.

Existing tools assume whole-file access at a known offset:

- [`libmagic`](https://github.com/file/file) and [Apache Tika](https://tika.apache.org/) match handcrafted magic-byte signatures, almost always anchored at the file head.
- [Magika](https://github.com/google/magika) (Google) is a small (~1 M-parameter) feedforward network over three 512-byte windows (head, middle, tail) of a known-bounded file.
- TrID, PRONOM/Siegfried/DROID similarly require a complete file.

These break down on a fragment. MimeLens is pretrained MLM-only on 1024-token windows sampled *uniformly at random* across files and 64 KB fragments, with no privileged head-of-file position. One checkpoint handles streaming, partial-arrival, mid-file, packet-payload, and forensic-carved inputs uniformly. The trade-off is CPU latency (roughly two orders of magnitude slower than Magika at the medium size; hardware-dependent) in exchange for libmagic's 125-class taxonomy plus position arbitrariness.

The family ships 28 parent cells (3 sizes Γ— 4 vocabs Γ— 2-3 seeds at seq\_len=1024) plus an 8-cell short-sequence extension (medium tier Γ— 4 vocabs Γ— 2 seeds at seq\_len=256). This README documents one of them.

> **Short-sequence sibling available.** If your inputs are sub-KB (DNS payloads, sub-MTU packets, small forensic fragments), use `mjbommar/mimelens-001-tiny-byte-s2-seq256` instead. Same architecture, 4Γ— shorter context, ~5Γ— lower CPU latency, BPE-cell accuracy ties or beats this cell on the magic-files probe-fit. See paper Appendix B.5.



---

## Overview

- **This cell**: `tiny` tier, `byte` input pipeline, seed `2`
- **Backbone**: 3.15M parameters (4 layers, hidden 256, 4 attention heads, head dim 64, RoPE, RMSNorm, no biases, no dropout)
- **Input vocabulary**: `byte`. Raw 256-byte vocabulary plus 5 special tokens (CLS, SEP, PAD, UNK, MASK); id = byte_value + 5. The model reads exactly the first 1,022 bytes that arrive.
- **Output**: 256-dim mean-pooled body-token embedding
- **Label space**: [libmagic](https://github.com/file/file) 125-class MIME taxonomy (full list in paper Appendix)
- **Pretraining**: MLM-only, 30% mask ratio, 33 GB stratified multi-source binary corpus, 22,888 gradient updates, single RTX 4060 Ti, ~2.7 h wall-clock
- **License**: MIT

## Headline benchmarks (this cell)

| Benchmark | Value |
|---|---|
| MIME-125 top-1 (magic-frags, 4 KB head, n=4,096)            | **0.739** |
| MIME-125 macro-F1 (magic-frags, 4 KB head)                  | 0.593 |
| kNN R@1 (magic-frags, 3,147-file gallery / 949 queries)     | 0.679 |

Full evaluation (within-cube bootstrap CIs, adversarial sweep, calibration, real-network curves, disk-block matrix, baselines against libmagic 5.46 and TrID 2.24) is in the [paper](https://github.com/mjbommar/mimelens-training).

---

## Quick start

This cell publishes the encoder only (no classifier head baked in). Use it to extract embeddings, then fit a probe, run kNN over a labelled gallery, or fine-tune a head:

```python
import torch
from transformers import AutoModel, AutoTokenizer

repo  = "mjbommar/mimelens-001-tiny-byte-s2"
model = AutoModel.from_pretrained(repo, trust_remote_code=True).eval()
tok   = AutoTokenizer.from_pretrained(repo)

window = open("path/to/file", "rb").read(4096)
inputs = tok(window.decode("latin-1"), max_length=1024, truncation=True,
             padding="max_length", return_tensors="pt")
with torch.no_grad():
    embedding = model(**inputs).pooler_output         # (1, 256)
```

The pre-fit LR probe weights for this cell are not bundled here. The deployed cells and per-size winners (e.g. `mimelens-001-medium-bpe-16k-s1`) ship a baked classifier head for a one-line `pipeline()` path.


---

## Choosing a window

The model reads the first ~1{,}022 tokens of whatever you pass β€” a prefix of the buffer (the first 1{,}022 bytes for this byte cell), not the whole window.

- **Magic-byte / compressed types** (PNG, ZIP, GZIP, JPEG): a **short head window (256 B--1 KB) classifies better than 4 KB**. A long high-entropy body dilutes the header signal within the fixed token budget, and the model returns `application/octet-stream` on a mostly-opaque window β€” correct behaviour for genuinely high-entropy input, not a bug.
- **Fragments / packets**: you cannot choose the offset, so pass what you have. This is the regime MimeLens is built for.

---

## Recommended deployment regimes

See the family hub README ([`mjbommar/mimelens-001`](https://huggingface.co/mjbommar/mimelens-001)) for the regime decision tree.

---

## Training

This cell is one point of the 3 Γ— 4 Γ— 2 factorial cube described in the paper.

- **Corpus** (33 GB, stratified multi-source): [`binary-30k`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized) (assorted ELF/PE/Mach-O), magic-frags (random 64 KB chunks across libmagic's full corpus), assorted packed/raw binaries, a [`glaurung`](https://github.com/mjbommar/glaurung)-sourced binary corpus, Windows drivers.
- **Position-arbitrary windowing**: 1024-token windows sampled uniformly at random across files and 64 KB fragments. **No privileged "head of file" position.** This is the design choice that makes MimeLens work on streaming / partial / random-offset inputs.
- **Objective**: MLM with 30% mask ratio (BERT replacement schedule: 80% `[MASK]`, 10% random, 10% original); tied input/output embeddings.
- **Pooling**: mean-pool over body tokens for downstream tasks. The BERT-style `cls_pool` linear projection is *not* used: under MLM-only training it receives no gradient and remains byte-identical to its random initialisation across all 28 cube cells (paper Β§3.4 verifies this; left in the saved weights for architectural completeness only).
- **Optimisation**: AdamW + cosine LR (peak 5e-4, 2,000-step warmup, 10% floor), bf16 mixed precision, gradient clipping at $\|g\|_2 \leq 1$, effective batch 128 at sequence length 1024, 22,888 gradient updates.
- **Hardware**: single RTX 4060 Ti (16 GB), ~2.7 h wall-clock for this cell.

---

## Caveats

- This is one cell of a 28-cell parent cube (36 released cells including the 8-cell short-sequence extension). Within-cube comparisons in the paper carry bootstrap CIs at n=2 seeds; some marginal orderings (byte vs bpe-16k at the largest size) are within seed noise and should be read as ties.
- The training corpus is one 33 GB stratified multi-source binary sample. Results may not transfer to substantially different corpora.
- All numbers are computed on data labelled by a single pipeline (libmagic-pinned). Cross-validation against PRONOM, Siegfried, DROID, or IANA reference files is a documented limitation.
- CPU latency at the `medium` size is ~155Γ— slower than Magika v1.1 on a desktop CPU (hardware-dependent). For sub-millisecond whole-file triage on broad categories, Magika is purpose-built and is the right tool. MimeLens occupies a different point on the deployment surface (position-arbitrary inputs + libmagic's 125-class taxonomy), not a drop-in replacement.
- End-to-end fine-tuning on the production label distribution may shift these numbers and should be evaluated before deployment. The frozen-probe numbers above are not claimed as a lower bound on fine-tuned performance.

---

## Citation

```bibtex
@misc{bommarito2026mimelens,
  title  = {MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments},
  author = {Bommarito II, Michael J.},
  year   = {2026},
  note   = {https://github.com/mjbommar/mimelens-training},
}
```