Instructions to use InstaDeepAI/nucleotide-transformer-v2-100m-multi-species with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use InstaDeepAI/nucleotide-transformer-v2-100m-multi-species with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="InstaDeepAI/nucleotide-transformer-v2-100m-multi-species", trust_remote_code=True)# Load model directly from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-100m-multi-species", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Commit ·
3e19aea
1
Parent(s): 9083e74
Update README.md
Browse files
README.md
CHANGED
|
@@ -43,8 +43,8 @@ tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-v2
|
|
| 43 |
model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-100m-multi-species", trust_remote_code=True)
|
| 44 |
|
| 45 |
# Create a dummy dna sequence and tokenize it
|
| 46 |
-
sequences = [
|
| 47 |
-
tokens_ids = tokenizer.batch_encode_plus(sequences, return_tensors="pt")["input_ids"]
|
| 48 |
|
| 49 |
# Compute the embeddings
|
| 50 |
attention_mask = tokens_ids != tokenizer.pad_token_id
|
|
@@ -60,8 +60,11 @@ embeddings = torch_outs['hidden_states'][-1].detach().numpy()
|
|
| 60 |
print(f"Embeddings shape: {embeddings.shape}")
|
| 61 |
print(f"Embeddings per token: {embeddings}")
|
| 62 |
|
|
|
|
|
|
|
|
|
|
| 63 |
# Compute mean embeddings per sequence
|
| 64 |
-
mean_sequence_embeddings = torch.sum(attention_mask
|
| 65 |
print(f"Mean sequence embeddings: {mean_sequence_embeddings}")
|
| 66 |
```
|
| 67 |
|
|
|
|
| 43 |
model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-100m-multi-species", trust_remote_code=True)
|
| 44 |
|
| 45 |
# Create a dummy dna sequence and tokenize it
|
| 46 |
+
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
|
| 47 |
+
tokens_ids = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]
|
| 48 |
|
| 49 |
# Compute the embeddings
|
| 50 |
attention_mask = tokens_ids != tokenizer.pad_token_id
|
|
|
|
| 60 |
print(f"Embeddings shape: {embeddings.shape}")
|
| 61 |
print(f"Embeddings per token: {embeddings}")
|
| 62 |
|
| 63 |
+
# Add embed dimension axis
|
| 64 |
+
attention_mask = torch.unsqueeze(attention_mask, dim=-1)
|
| 65 |
+
|
| 66 |
# Compute mean embeddings per sequence
|
| 67 |
+
mean_sequence_embeddings = torch.sum(attention_mask*embeddings, axis=-2)/torch.sum(attention_mask, axis=1)
|
| 68 |
print(f"Mean sequence embeddings: {mean_sequence_embeddings}")
|
| 69 |
```
|
| 70 |
|