Instructions to use InstaDeepAI/nucleotide-transformer-v2-100m-multi-species with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use InstaDeepAI/nucleotide-transformer-v2-100m-multi-species with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="InstaDeepAI/nucleotide-transformer-v2-100m-multi-species", trust_remote_code=True)# Load model directly from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-100m-multi-species", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Commit ·
b3abd7b
1
Parent(s): 3e19aea
Update README.md
Browse files
README.md
CHANGED
|
@@ -42,6 +42,11 @@ import torch
|
|
| 42 |
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-100m-multi-species", trust_remote_code=True)
|
| 43 |
model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-100m-multi-species", trust_remote_code=True)
|
| 44 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
# Create a dummy dna sequence and tokenize it
|
| 46 |
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
|
| 47 |
tokens_ids = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]
|
|
|
|
| 42 |
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-100m-multi-species", trust_remote_code=True)
|
| 43 |
model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-100m-multi-species", trust_remote_code=True)
|
| 44 |
|
| 45 |
+
# Choose the length to which the input sequences are padded. By default, the
|
| 46 |
+
# model max length is chosen, but feel free to decrease it as the time taken to
|
| 47 |
+
# obtain the embeddings increases significantly with it.
|
| 48 |
+
max_length = tokenizer.model_max_length
|
| 49 |
+
|
| 50 |
# Create a dummy dna sequence and tokenize it
|
| 51 |
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
|
| 52 |
tokens_ids = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]
|