BacHyenaDNA

BacHyenaDNA is a collection of foundation model pretrained bacteria DNA at single nucleotide resolution. The models are based on the HyenaDNA [1] model originaly trained on the hg38 human genome. See more information on our github

The current available species are:

BacHyenaDNA-Paeruginosa-32k-d768 overview

BacHyenaDNA-Paeruginosa-32k-d768 uses 10 HyenaDNA block layers, with inner_dim=3072, model_dim=768, max_seq_len=32768. It was pretrained on P. aeruginosa complete genomes from RefSeq database (802), all assembled P. aeruginosa genomes from GenBank (18466) and all plasmid sequences from P. aeruginosa found in PLSDB (373). It was pretrained using next token prediction, with a vocab of 4 nucleotides plus special tokens.

Get models embeddings

This code show an example of how to get the embeddings from the selected models. The classes needed can be found in our github repository BacHyenaDNA


from huggingface import HyenaDNAPreTrainedModel
from standalone_hyenadna import CharacterTokenizer
import torch

# instantiate pretrained model
pretrained_model_name = 'BacHyenaDNA-Paeruginosa-32k-d768'
max_length = 32768
device = "cuda:0"

model = HyenaDNAPreTrainedModel.from_pretrained(
    'downloaded_models/',
    pretrained_model_name,
)

# create tokenizer
tokenizer = CharacterTokenizer(
    characters=['A', 'C', 'G', 'T', 'N'],
    model_max_length=max_length,
)

# create a sample
sequence = 'ACTG'
tok_seq = tokenizer(sequence)["input_ids"]

# place on device, convert to tensor
tok_seq = torch.LongTensor(tok_seq).unsqueeze(0).to(device)  # unsqueeze for batch dim

# model
model.to(device)
model.eval()

# forward
with torch.inference_mode():
    embeddings = model(tok_seq)

print(embeddings)


GPU requirements (suggested)

GPU during: Pretrain, fine-tune, inference
Coming soon

Reference

[1] NGUYEN, Eric, POLI, Michael, FAIZI, Marjan, et al. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. Advances in neural information processing systems, 2023, vol. 36, p. 43177-43201.
Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support