File size: 13,975 Bytes
ec20320
0a01a7a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1efc13a
ec20320
660c7b1
0a01a7a
660c7b1
 
 
0a01a7a
1efc13a
0a01a7a
1efc13a
0a01a7a
 
 
660c7b1
1efc13a
 
0a01a7a
660c7b1
 
0a01a7a
 
660c7b1
0a01a7a
660c7b1
0a01a7a
660c7b1
0a01a7a
 
 
 
 
 
 
 
 
 
660c7b1
 
 
0a01a7a
 
660c7b1
 
 
0a01a7a
 
660c7b1
 
 
 
 
1efc13a
 
0a01a7a
1efc13a
660c7b1
 
 
 
 
 
 
0a01a7a
660c7b1
0a01a7a
 
 
 
660c7b1
75186ff
660c7b1
 
 
 
0a01a7a
 
 
 
660c7b1
 
 
 
 
0a01a7a
660c7b1
 
 
 
 
 
 
0a01a7a
660c7b1
0a01a7a
660c7b1
 
 
 
 
 
 
 
 
0a01a7a
660c7b1
0a01a7a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
660c7b1
 
 
0a01a7a
660c7b1
0a01a7a
 
660c7b1
 
 
0a01a7a
 
660c7b1
0a01a7a
 
660c7b1
0a01a7a
660c7b1
 
0a01a7a
 
 
660c7b1
 
 
 
0a01a7a
 
660c7b1
 
 
0a01a7a
 
660c7b1
0a01a7a
660c7b1
0a01a7a
660c7b1
 
 
 
 
0a01a7a
660c7b1
 
 
 
 
 
 
 
 
 
0a01a7a
660c7b1
 
 
0a01a7a
660c7b1
 
 
0a01a7a
660c7b1
0a01a7a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
660c7b1
 
 
0a01a7a
 
 
 
 
 
 
660c7b1
0a01a7a
660c7b1
0a01a7a
 
 
 
660c7b1
 
 
 
 
0a01a7a
 
 
 
 
660c7b1
 
 
0a01a7a
 
 
 
 
 
 
660c7b1
0a01a7a
660c7b1
0a01a7a
 
 
 
660c7b1
 
 
0a01a7a
 
660c7b1
 
 
 
0a01a7a
660c7b1
0a01a7a
660c7b1
0a01a7a
660c7b1
0a01a7a
 
 
660c7b1
0a01a7a
660c7b1
0a01a7a
660c7b1
 
 
 
 
 
0a01a7a
 
 
 
 
 
660c7b1
 
 
 
0a01a7a
1efc13a
75186ff
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
---
license:
- mit
- other
license_link: https://github.com/Biohub/esm/blob/main/THIRD_PARTY_NOTICE.md
library_name: transformers
language: en
tags:
- biology
- esm
- protein
- protein-language-model
- protein-embeddings
- masked-language-modeling
- transfer-learning
- variant-effect-prediction
- protein-engineering
- transformers
---

# Model Card for ESMC

## Model Details

ESMC is a state-of-the-art protein language model that has learned the rules of protein biology from training on billions of protein sequences. ESMC provides representations of proteins enabling novel AI applications from therapeutic protein engineering to unlocking basic insights into protein biology across life.

The ESMC 6B model has 6 billion parameters, with 80 layers and 2e23 training flops. We additionally release overtrained 300M and 600M parameter variants of ESMC for local inference and finetuning.

The [ESMFold2](https://huggingface.co/biohub/ESMFold2) structure prediction models are trained on top of a frozen ESMC 6B language model. ESMFold2 is a state-of-the-art model for protein structure prediction and design that defines a new frontier for speed and accuracy.

The [ESMC sparse autoencoder](https://huggingface.co/Biohub/esmc-6b-2024-12-sae-sweep-layer60-k64-codebook16384), `ESMC-6B-sae-layer60-k64-codebook16384`, is built on the ESMC 6B model and provides human-interpretable, agent-generated feature descriptions. See the [model card](https://huggingface.co/Biohub/ESMC-6B-sae-sweep-layer60-k64-codebook16384) for details and learn more about the ESMC SAEs [here](https://huggingface.co/biohub/ESMC-SAE-Overview).

To run this model with the Biohub Platform API, visit the [Biohub Platform](https://biohub.ai/).

Read more about ESMC in our paper [here](https://biohub.ai/papers/esm_protein.pdf).

### Example Usage

```py
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

GFP = "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK"

model = AutoModelForMaskedLM.from_pretrained("biohub/ESMC-6B", device_map="auto").eval()
tokenizer = AutoTokenizer.from_pretrained("biohub/ESMC-6B")

inputs = tokenizer(GFP, return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.inference_mode():
    output = model(**inputs)

print(f"logits shape: {tuple(output.logits.shape)}")

with torch.inference_mode():
    output = model(**inputs)


```

By default, the model returns only the final layer representations. To return hidden states from **all transformer layers**, set:

```py
output = model(**inputs, output_hidden_states=True)
```

For detailed usage, refer to the [Usage section below](#usage).

### Citation

ESM Team. "ESM Cambrian: Revealing the mysteries of proteins with unsupervised learning." EvolutionaryScale Website, December 4, 2024\. [Paper.](https://biohub.ai/papers/esmc.pdf)

### Model Architecture

ESMC is based on the transformer architecture. It features Pre-LN, rotary embeddings, and SwiGLU activations. No biases are used in linear layers or layer norms.

### Parameters

ESMC was trained at multiple scales:

| Model | Parameters | Layers | Training FLOPs |
| :---- | ----: | ----: | ----: |
| **ESMC-300M** | 300M | 30 | 1×1022 |
| **ESMC-600M** | 600M | 36 | 2×1022 |
| **ESMC-6B** | 6B | 80 | 2×1023 |

![][pal]

### Model Variants

| Model Variant | Description | URL |
| :---- | :---- | :---- |
| ESMC 300M | Smallest variant, publicly released. | [https://huggingface.co/Biohub/ESMC-300M](https://huggingface.co/Biohub/ESMC-300M) |
| ESMC 600M | Medium variant, publicly released. | \[https://huggingface.co/Biohub/ESMC-600M\]([https://huggingface.co/Biohub/ESMC-600M](https://huggingface.co/Biohub/ESMC-600M) |
| ESMC 6B | Large variant, available via API | [https://huggingface.co/Biohub/ESMC-6B](https://huggingface.co/Biohub/ESMC-6B) |

### System Requirements

- Compute Requirements: GPU
- PyTorch environment with GPU support recommended.
- Recommended optional libraries: transformer\_engine, xformers

## Training Data

ESMC was trained on protein sequences from UniRef, MGnify, and the Joint Genome Institute (JGI). Sequence data was clustered at 70% sequence identity, resulting in 83M, 372M, and 2B clusters for UniRef, MGnify, and JGI, respectively.

### Training Procedure

Training was conducted in two stages:

- Stage 1: For the first 1 million steps, the model used a context length of 512, with metagenomic data constituting 64% of the training dataset.
- Stage 2: In the final 500,000 steps, the context length was increased to 2048, and the proportion of metagenomic data was reduced to 37.5%.

## Performance Metrics

Performance metrics are detailed on our [blog announcing ESMC](http://biohub.ai/esmc).

## Usage

### Flash Attention

Instead of scaled dot product attention (sdpa) you can use a flash attention backend. This requires running the model in bfloat16.

```py
model = (
    AutoModelForMaskedLM.from_pretrained(
        "biohub/ESMC-6B",
        dtype=torch.bfloat16,
        device_map="auto",
        attn_implementation="flash_attention_2",
    )
    .to(torch.bfloat16)
    .eval()
)
```

### Sparse Autoencoder (SAE)

To get interpretable features from ESMC 6B hidden states and per-layer residual updates, you can choose from our pretrained SAEs. We provide the follow three:

* [ESMC SAEs for hidden states (all layers)](https://huggingface.co/collections/biohub/esmc-saes-for-hidden-states-all-layers)
* [ESMC SAEs for one layer (different sparsity/codebook size)](https://huggingface.co/collections/biohub/esmc-saes-for-one-layer-different-sparsity-codebook-size)
* [ESMC SAEs for MLP outputs (all layers)](https://huggingface.co/collections/biohub/esmc-saes-for-mlp-outputs-all-layers)

```py
import torch
from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer

GFP = "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK"

model = AutoModelForMaskedLM.from_pretrained("biohub/ESMC-6B", device_map="auto").eval()
tokenizer = AutoTokenizer.from_pretrained("biohub/ESMC-6B")

sae_models = []
sae = AutoModel.from_pretrained(
    "biohub/ESMC-6B-sae-sweep-layer60-k64-codebook16384", device_map="auto"
)
sae_models.append(sae)

model.add_sae_models(sae_models)

inputs = tokenizer(GFP, return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.inference_mode():
    output = model(**inputs)

print(f"logits shape: {tuple(output.logits.shape)}")
print(f"num SAE outputs: {len(output.sae_outputs)}")
for i, sae_out in enumerate(output.sae_outputs):
    print(f"  SAE[{i}]: {type(sae_out).__name__}")
```

### Masked Language Modeling

ESMC can predict masked amino acids and compute the corresponding loss:

```py
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

GFP = "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK"
masked_GFP = "<mask>SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK"

model = AutoModelForMaskedLM.from_pretrained("biohub/ESMC-6B", device_map="auto").eval()
tokenizer = AutoTokenizer.from_pretrained("biohub/ESMC-6B")

inputs = tokenizer(masked_GFP, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

labels = tokenizer(GFP, return_tensors="pt")["input_ids"].to(model.device)
# Only the masked positions contribute to the loss; everything else gets the
# ``-100`` ignore-index that ``CrossEntropyLoss`` skips.
labels = torch.where(inputs["input_ids"] == tokenizer.mask_token_id, labels, -100)

with torch.inference_mode():
    output = model(**inputs, labels=labels)

print(f"Loss: {output.loss.item():.6f}")
```

### Fine-tuning with peft

```py
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("biohub/ESMC-6B", device_map="auto")

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.01,
    target_modules=["layernorm_qkv.1", "out_proj", "ffn.1", "ffn.3"],
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
```

### Attention Maps

To extract attention maps, pass `output_attentions=True`. Note: this is incompatible with `attn_implementation="flash_attention_2"`.

```py
output = model(**inputs, output_attentions=True)
# output.attentions: tuple of (batch, n_heads, seq_len, seq_len) tensors, one per layer
```

`output_attentions=True` triggers a manual, unoptimized attention path to extract the attention maps, which will reduce inference speed.

### Other Usage

You can access the base model without the pretrained LM head:

```py
import torch
from transformers import AutoModel, AutoTokenizer

GFP = "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK"

model = AutoModel.from_pretrained("biohub/ESMC-6B", device_map="auto").eval()
tokenizer = AutoTokenizer.from_pretrained("biohub/ESMC-6B")

inputs = tokenizer(GFP, return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.inference_mode():
    output = model(**inputs)

print(f"last_hidden_state shape: {tuple(output.last_hidden_state.shape)}")
```

Or use ESMC for Token Classification:

```py
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer

GFP = "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK"

model = AutoModelForTokenClassification.from_pretrained(
    "biohub/ESMC-6B", device_map="auto"
).eval()
tokenizer = AutoTokenizer.from_pretrained("biohub/ESMC-6B")

inputs = tokenizer(GFP, return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.inference_mode():
    output = model(**inputs)

predicted_token_class_ids = output.logits.argmax(-1)
predicted_tokens_classes = [
    model.config.id2label[t.item()] for t in predicted_token_class_ids[0]
]
print(f"logits shape: {tuple(output.logits.shape)}")
print(f"first 8 predicted classes: {predicted_tokens_classes[:8]}")
```

or Sequence Classification:

```py
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

GFP = "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK"

model = AutoModelForSequenceClassification.from_pretrained(
    "biohub/ESMC-6B", device_map="auto", num_labels=2
).eval()
tokenizer = AutoTokenizer.from_pretrained("biohub/ESMC-6B")

inputs = tokenizer(GFP, return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.inference_mode():
    output = model(**inputs)

print(f"logits shape: {tuple(output.logits.shape)}")
```

For Token or Sequence Classification, the classifier head is not pretrained but instead meant to be fine-tuned for your downstream task.

## Frontier Safety

Biohub has established a safety team to assess the benefits and potential risks of our models and tools prior to release, and develop mitigations where necessary. Informed by our risk assessments, we are releasing the source code and model weights for ESMC 6B, ESMFold2, and ESMC SAEs. We are also releasing our ESM Atlas dataset and binder design system openly.

Prior to release, we conducted evaluations to inform our understanding of capability uplift for specific misuse-relevant functional tasks. The full details of these evaluations are available in our corresponding paper appendix.

[Biohub.ai](http://Biohub.ai) Platform: We implement guardrails that detect and restrict the use of keywords and sequences corresponding to controlled pathogens and toxins on our freely accessible platform. For further details regarding these guardrails, please refer to our Biohub platform Resources page.

## Biases and Limitations

### Potential Biases

- **Dataset bias:** Over- or under-representation of taxa, protein families, or ecological niches in public sequence and structure databases influences generalization and can bias outputs. This is partially mitigated by clustering-based, nonredundant sampling.

### Limitations

- **Context window:** ESMC has a context window limit of 2048 tokens.
- **Reliance on in-silico metrics:** Computational metrics do not replace wet-lab validation.

### Out-of-Scope or Unauthorized Use Cases

Do not use the model for the following purposes:

- Any use that is prohibited by the [Acceptable Use Policy](https://biohub.org/acceptable-use-policy/).

### Caveats and Recommendations

- Review and validate outputs generated by the model.
- We are committed to advancing the responsible development and use of artificial intelligence.
- Should you have any security or privacy issues or questions related to the services, please reach out to our team at [support@biohub.org](mailto:support@biohub.org).

[pal]: images/contact_pal.png