Fill-Mask
Transformers
Safetensors
English
esmc
biology
esm
protein
protein-language-model
protein-embeddings
masked-language-modeling
transfer-learning
variant-effect-prediction
protein-engineering
Instructions to use biohub/ESMC-300M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use biohub/ESMC-300M with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="biohub/ESMC-300M")# Load model directly from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("biohub/ESMC-300M", dtype="auto") - Inference
- Notebooks
- Google Colab
- Kaggle
File size: 13,975 Bytes
ec20320 0a01a7a 1efc13a ec20320 660c7b1 0a01a7a 660c7b1 0a01a7a 1efc13a 0a01a7a 1efc13a 0a01a7a 660c7b1 1efc13a 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 1efc13a 0a01a7a 1efc13a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 75186ff 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 660c7b1 0a01a7a 1efc13a 75186ff | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 | ---
license:
- mit
- other
license_link: https://github.com/Biohub/esm/blob/main/THIRD_PARTY_NOTICE.md
library_name: transformers
language: en
tags:
- biology
- esm
- protein
- protein-language-model
- protein-embeddings
- masked-language-modeling
- transfer-learning
- variant-effect-prediction
- protein-engineering
- transformers
---
# Model Card for ESMC
## Model Details
ESMC is a state-of-the-art protein language model that has learned the rules of protein biology from training on billions of protein sequences. ESMC provides representations of proteins enabling novel AI applications from therapeutic protein engineering to unlocking basic insights into protein biology across life.
The ESMC 6B model has 6 billion parameters, with 80 layers and 2e23 training flops. We additionally release overtrained 300M and 600M parameter variants of ESMC for local inference and finetuning.
The [ESMFold2](https://huggingface.co/biohub/ESMFold2) structure prediction models are trained on top of a frozen ESMC 6B language model. ESMFold2 is a state-of-the-art model for protein structure prediction and design that defines a new frontier for speed and accuracy.
The [ESMC sparse autoencoder](https://huggingface.co/Biohub/esmc-6b-2024-12-sae-sweep-layer60-k64-codebook16384), `ESMC-6B-sae-layer60-k64-codebook16384`, is built on the ESMC 6B model and provides human-interpretable, agent-generated feature descriptions. See the [model card](https://huggingface.co/Biohub/ESMC-6B-sae-sweep-layer60-k64-codebook16384) for details and learn more about the ESMC SAEs [here](https://huggingface.co/biohub/ESMC-SAE-Overview).
To run this model with the Biohub Platform API, visit the [Biohub Platform](https://biohub.ai/).
Read more about ESMC in our paper [here](https://biohub.ai/papers/esm_protein.pdf).
### Example Usage
```py
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
GFP = "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK"
model = AutoModelForMaskedLM.from_pretrained("biohub/ESMC-6B", device_map="auto").eval()
tokenizer = AutoTokenizer.from_pretrained("biohub/ESMC-6B")
inputs = tokenizer(GFP, return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.inference_mode():
output = model(**inputs)
print(f"logits shape: {tuple(output.logits.shape)}")
with torch.inference_mode():
output = model(**inputs)
```
By default, the model returns only the final layer representations. To return hidden states from **all transformer layers**, set:
```py
output = model(**inputs, output_hidden_states=True)
```
For detailed usage, refer to the [Usage section below](#usage).
### Citation
ESM Team. "ESM Cambrian: Revealing the mysteries of proteins with unsupervised learning." EvolutionaryScale Website, December 4, 2024\. [Paper.](https://biohub.ai/papers/esmc.pdf)
### Model Architecture
ESMC is based on the transformer architecture. It features Pre-LN, rotary embeddings, and SwiGLU activations. No biases are used in linear layers or layer norms.
### Parameters
ESMC was trained at multiple scales:
| Model | Parameters | Layers | Training FLOPs |
| :---- | ----: | ----: | ----: |
| **ESMC-300M** | 300M | 30 | 1×1022 |
| **ESMC-600M** | 600M | 36 | 2×1022 |
| **ESMC-6B** | 6B | 80 | 2×1023 |
![][pal]
### Model Variants
| Model Variant | Description | URL |
| :---- | :---- | :---- |
| ESMC 300M | Smallest variant, publicly released. | [https://huggingface.co/Biohub/ESMC-300M](https://huggingface.co/Biohub/ESMC-300M) |
| ESMC 600M | Medium variant, publicly released. | \[https://huggingface.co/Biohub/ESMC-600M\]([https://huggingface.co/Biohub/ESMC-600M](https://huggingface.co/Biohub/ESMC-600M) |
| ESMC 6B | Large variant, available via API | [https://huggingface.co/Biohub/ESMC-6B](https://huggingface.co/Biohub/ESMC-6B) |
### System Requirements
- Compute Requirements: GPU
- PyTorch environment with GPU support recommended.
- Recommended optional libraries: transformer\_engine, xformers
## Training Data
ESMC was trained on protein sequences from UniRef, MGnify, and the Joint Genome Institute (JGI). Sequence data was clustered at 70% sequence identity, resulting in 83M, 372M, and 2B clusters for UniRef, MGnify, and JGI, respectively.
### Training Procedure
Training was conducted in two stages:
- Stage 1: For the first 1 million steps, the model used a context length of 512, with metagenomic data constituting 64% of the training dataset.
- Stage 2: In the final 500,000 steps, the context length was increased to 2048, and the proportion of metagenomic data was reduced to 37.5%.
## Performance Metrics
Performance metrics are detailed on our [blog announcing ESMC](http://biohub.ai/esmc).
## Usage
### Flash Attention
Instead of scaled dot product attention (sdpa) you can use a flash attention backend. This requires running the model in bfloat16.
```py
model = (
AutoModelForMaskedLM.from_pretrained(
"biohub/ESMC-6B",
dtype=torch.bfloat16,
device_map="auto",
attn_implementation="flash_attention_2",
)
.to(torch.bfloat16)
.eval()
)
```
### Sparse Autoencoder (SAE)
To get interpretable features from ESMC 6B hidden states and per-layer residual updates, you can choose from our pretrained SAEs. We provide the follow three:
* [ESMC SAEs for hidden states (all layers)](https://huggingface.co/collections/biohub/esmc-saes-for-hidden-states-all-layers)
* [ESMC SAEs for one layer (different sparsity/codebook size)](https://huggingface.co/collections/biohub/esmc-saes-for-one-layer-different-sparsity-codebook-size)
* [ESMC SAEs for MLP outputs (all layers)](https://huggingface.co/collections/biohub/esmc-saes-for-mlp-outputs-all-layers)
```py
import torch
from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer
GFP = "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK"
model = AutoModelForMaskedLM.from_pretrained("biohub/ESMC-6B", device_map="auto").eval()
tokenizer = AutoTokenizer.from_pretrained("biohub/ESMC-6B")
sae_models = []
sae = AutoModel.from_pretrained(
"biohub/ESMC-6B-sae-sweep-layer60-k64-codebook16384", device_map="auto"
)
sae_models.append(sae)
model.add_sae_models(sae_models)
inputs = tokenizer(GFP, return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.inference_mode():
output = model(**inputs)
print(f"logits shape: {tuple(output.logits.shape)}")
print(f"num SAE outputs: {len(output.sae_outputs)}")
for i, sae_out in enumerate(output.sae_outputs):
print(f" SAE[{i}]: {type(sae_out).__name__}")
```
### Masked Language Modeling
ESMC can predict masked amino acids and compute the corresponding loss:
```py
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
GFP = "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK"
masked_GFP = "<mask>SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK"
model = AutoModelForMaskedLM.from_pretrained("biohub/ESMC-6B", device_map="auto").eval()
tokenizer = AutoTokenizer.from_pretrained("biohub/ESMC-6B")
inputs = tokenizer(masked_GFP, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
labels = tokenizer(GFP, return_tensors="pt")["input_ids"].to(model.device)
# Only the masked positions contribute to the loss; everything else gets the
# ``-100`` ignore-index that ``CrossEntropyLoss`` skips.
labels = torch.where(inputs["input_ids"] == tokenizer.mask_token_id, labels, -100)
with torch.inference_mode():
output = model(**inputs, labels=labels)
print(f"Loss: {output.loss.item():.6f}")
```
### Fine-tuning with peft
```py
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("biohub/ESMC-6B", device_map="auto")
lora_config = LoraConfig(
r=8,
lora_alpha=16,
lora_dropout=0.01,
target_modules=["layernorm_qkv.1", "out_proj", "ffn.1", "ffn.3"],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
```
### Attention Maps
To extract attention maps, pass `output_attentions=True`. Note: this is incompatible with `attn_implementation="flash_attention_2"`.
```py
output = model(**inputs, output_attentions=True)
# output.attentions: tuple of (batch, n_heads, seq_len, seq_len) tensors, one per layer
```
`output_attentions=True` triggers a manual, unoptimized attention path to extract the attention maps, which will reduce inference speed.
### Other Usage
You can access the base model without the pretrained LM head:
```py
import torch
from transformers import AutoModel, AutoTokenizer
GFP = "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK"
model = AutoModel.from_pretrained("biohub/ESMC-6B", device_map="auto").eval()
tokenizer = AutoTokenizer.from_pretrained("biohub/ESMC-6B")
inputs = tokenizer(GFP, return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.inference_mode():
output = model(**inputs)
print(f"last_hidden_state shape: {tuple(output.last_hidden_state.shape)}")
```
Or use ESMC for Token Classification:
```py
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer
GFP = "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK"
model = AutoModelForTokenClassification.from_pretrained(
"biohub/ESMC-6B", device_map="auto"
).eval()
tokenizer = AutoTokenizer.from_pretrained("biohub/ESMC-6B")
inputs = tokenizer(GFP, return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.inference_mode():
output = model(**inputs)
predicted_token_class_ids = output.logits.argmax(-1)
predicted_tokens_classes = [
model.config.id2label[t.item()] for t in predicted_token_class_ids[0]
]
print(f"logits shape: {tuple(output.logits.shape)}")
print(f"first 8 predicted classes: {predicted_tokens_classes[:8]}")
```
or Sequence Classification:
```py
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
GFP = "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK"
model = AutoModelForSequenceClassification.from_pretrained(
"biohub/ESMC-6B", device_map="auto", num_labels=2
).eval()
tokenizer = AutoTokenizer.from_pretrained("biohub/ESMC-6B")
inputs = tokenizer(GFP, return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.inference_mode():
output = model(**inputs)
print(f"logits shape: {tuple(output.logits.shape)}")
```
For Token or Sequence Classification, the classifier head is not pretrained but instead meant to be fine-tuned for your downstream task.
## Frontier Safety
Biohub has established a safety team to assess the benefits and potential risks of our models and tools prior to release, and develop mitigations where necessary. Informed by our risk assessments, we are releasing the source code and model weights for ESMC 6B, ESMFold2, and ESMC SAEs. We are also releasing our ESM Atlas dataset and binder design system openly.
Prior to release, we conducted evaluations to inform our understanding of capability uplift for specific misuse-relevant functional tasks. The full details of these evaluations are available in our corresponding paper appendix.
[Biohub.ai](http://Biohub.ai) Platform: We implement guardrails that detect and restrict the use of keywords and sequences corresponding to controlled pathogens and toxins on our freely accessible platform. For further details regarding these guardrails, please refer to our Biohub platform Resources page.
## Biases and Limitations
### Potential Biases
- **Dataset bias:** Over- or under-representation of taxa, protein families, or ecological niches in public sequence and structure databases influences generalization and can bias outputs. This is partially mitigated by clustering-based, nonredundant sampling.
### Limitations
- **Context window:** ESMC has a context window limit of 2048 tokens.
- **Reliance on in-silico metrics:** Computational metrics do not replace wet-lab validation.
### Out-of-Scope or Unauthorized Use Cases
Do not use the model for the following purposes:
- Any use that is prohibited by the [Acceptable Use Policy](https://biohub.org/acceptable-use-policy/).
### Caveats and Recommendations
- Review and validate outputs generated by the model.
- We are committed to advancing the responsible development and use of artificial intelligence.
- Should you have any security or privacy issues or questions related to the services, please reach out to our team at [support@biohub.org](mailto:support@biohub.org).
[pal]: images/contact_pal.png
|