Instructions to use entropy/roberta_zinc_480m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use entropy/roberta_zinc_480m with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="entropy/roberta_zinc_480m")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("entropy/roberta_zinc_480m") model = AutoModelForMaskedLM.from_pretrained("entropy/roberta_zinc_480m") - Notebooks
- Google Colab
- Kaggle
| tags: | |
| - chemistry | |
| - molecule | |
| - drug | |
| # Model Card for Roberta Zinc 480m | |
| ### Model Description | |
| `roberta_zinc_480m` is a ~102m parameter Roberta-style masked language model ~480m SMILES | |
| strings from the [ZINC database](https://zinc.docking.org/). This model is useful for | |
| generating embeddings from SMILES strings. | |
| - **Developed by:** Karl Heyer | |
| - **License:** MIT | |
| ### Direct Use | |
| Usage examples. Note that input SMILES strings should be canonicalized. | |
| With the Transformers library: | |
| ```python | |
| from transformers import AutoModel, AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("entropy/roberta_zinc_480m") | |
| roberta_zinc = AutoModel.from_pretrained("entropy/roberta_zinc_480m", | |
| add_pooling_layer=False) # model was not trained with a pooler | |
| # smiles should be canonicalized | |
| smiles = [ | |
| "Brc1cc2c(NCc3ccccc3)ncnc2s1", | |
| "Brc1cc2c(NCc3ccccn3)ncnc2s1", | |
| "Brc1cc2c(NCc3cccs3)ncnc2s1", | |
| "Brc1cc2c(NCc3ccncc3)ncnc2s1", | |
| "Brc1cc2c(Nc3ccccc3)ncnc2s1" | |
| ] | |
| batch = tokenizer(smiles, return_tensors='pt', padding=True, pad_to_multiple_of=8) | |
| # mean pooling | |
| outputs = roberta_zinc(**batch, output_hidden_states=True) | |
| full_embeddings = outputs[1][-1] | |
| mask = batch['attention_mask'] | |
| embeddings = ((full_embeddings * mask.unsqueeze(-1)).sum(1) / mask.sum(-1).unsqueeze(-1)) | |
| ``` | |
| With Sentence Transformers: | |
| ```python | |
| from sentence_transformers import models, SentenceTransformer | |
| transformer = models.Transformer("entropy/roberta_zinc_480m", | |
| max_seq_length=256, | |
| model_args={"add_pooling_layer": False}) | |
| pooling = models.Pooling(transformer.get_word_embedding_dimension(), | |
| pooling_mode="mean") | |
| model = SentenceTransformer(modules=[transformer, pooling]) | |
| # smiles should be canonicalized | |
| smiles = [ | |
| "Brc1cc2c(NCc3ccccc3)ncnc2s1", | |
| "Brc1cc2c(NCc3ccccn3)ncnc2s1", | |
| "Brc1cc2c(NCc3cccs3)ncnc2s1", | |
| "Brc1cc2c(NCc3ccncc3)ncnc2s1", | |
| "Brc1cc2c(Nc3ccccc3)ncnc2s1" | |
| ] | |
| embeddings = model.encode(smiles, convert_to_tensor=True) | |
| ``` | |
| ### Training Procedure | |
| #### Preprocessing | |
| ~480m SMILES strings were randomly sampled from the [ZINC database](https://zinc.docking.org/), | |
| weighted by tranche size (ie more SMILES were sampled from larger tranches). SMILES were | |
| canonicalized, then used to train the tokenizer. | |
| #### Training Hyperparameters | |
| The model was trained with cross entropy loss for 150,000 iterations with a batch size of | |
| 4096. The model achieved a validation loss of ~0.122. | |
| ### Downstream Models | |
| #### Decoder | |
| There is a [decoder model](https://huggingface.co/entropy/roberta_zinc_decoder) trained to reconstruct | |
| inputs from embeddings generated with this model | |
| #### Compression Encoder | |
| There is a [compression encoder model](https://huggingface.co/entropy/roberta_zinc_compression_encoder) | |
| trained to compress embeddings generated by this model from the native size of 768 to | |
| smaller sizes (512, 256, 128, 64, 32) while preserving cosine similarity between embeddings. | |
| #### Decomposer | |
| There is a [embedding decomposer model](https://huggingface.co/entropy/roberta_zinc_enamine_decomposer) | |
| trained to "decompose" a roberta-zinc embedding into two building block embeddings from the Enamine | |
| library. | |
| **BibTeX:** | |
| @misc{heyer2023roberta, | |
| title={Roberta-zinc-480m}, | |
| author={Heyer, Karl}, | |
| year={2023} | |
| } | |
| **APA:** | |
| Heyer, K. (2023). Roberta-zinc-480m. | |
| ## Model Card Authors | |
| Karl Heyer | |
| ## Model Card Contact | |
| karl@darmatterai.xyz | |
| --- | |
| license: mit | |
| --- | |