Instructions to use entropy/roberta_zinc_480m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use entropy/roberta_zinc_480m with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="entropy/roberta_zinc_480m")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("entropy/roberta_zinc_480m") model = AutoModelForMaskedLM.from_pretrained("entropy/roberta_zinc_480m") - Notebooks
- Google Colab
- Kaggle
metadata
tags:
- chemistry
- molecule
- drug
Roberta Zinc 480m
This is a Roberta style masked language model trained on ~480m SMILES strings from the ZINC database. The model has ~102m parameters and was trained for 150000 iterations with a batch size of 4096 to a validation loss of ~0.122. This model is useful for generating embeddings from SMILES strings.
from transformers import RobertaTokenizerFast, RobertaForMaskedLM, DataCollatorWithPadding
tokenizer = RobertaTokenizerFast.from_pretrained("entropy/roberta_zinc_480m", max_len=128)
model = RobertaForMaskedLM.from_pretrained('entropy/roberta_zinc_480m')
collator = DataCollatorWithPadding(tokenizer, padding=True, return_tensors='pt')
smiles = ['Brc1cc2c(NCc3ccccc3)ncnc2s1',
'Brc1cc2c(NCc3ccccn3)ncnc2s1',
'Brc1cc2c(NCc3cccs3)ncnc2s1',
'Brc1cc2c(NCc3ccncc3)ncnc2s1',
'Brc1cc2c(Nc3ccccc3)ncnc2s1']
inputs = collator(tokenizer(smiles))
outputs = model(**inputs, output_hidden_states=True)
full_embeddings = outputs[1][-1]
mask = inputs['attention_mask']
embeddings = ((full_embeddings * mask.unsqueeze(-1)).sum(1) / mask.sum(-1).unsqueeze(-1))
Decoder
There is also a decoder model trained to reconstruct inputs from embeddings