Nucleotide Transformer v2 50M, fine-tuned on GenomicBenchmarks

Full weight-level fine-tuning of InstaDeepAI/nucleotide-transformer-v2-50m-multi-species for binary DNA sequence classification on two GenomicBenchmarks tasks. All parameters are updated rather than using LoRA or a frozen backbone, with a leakage-free train/validation/test protocol and multi-seed evaluation.

Weights are not redistributed: this is a card and code release

The base model is licensed CC-BY-NC-SA-4.0 (non-commercial, share-alike), and any fine-tuned derivative inherits those terms. To respect that license, the trained checkpoints are not hosted here. The full training and evaluation code is open source and the run is reproducible, with every seed, split, and hyperparameter fixed.

Code, reproduction steps, figures, and the full model card: github.com/ankurgenomics/genome-ft

Running the documented commands regenerates the checkpoints described below.

Results

All numbers are on the held-out test set, evaluated exactly once on the best checkpoint (selected by validation MCC). The base model is the pretrained backbone with an untrained classification head, measured under the identical pipeline.

Enhancers: `human_enhancers_cohn` (6,948 test examples, 3 seeds: 42, 0, 123)

Model	Accuracy	F1	MCC
Base (pretrained backbone, untrained head)	0.499	0.021	−0.009
Fine-tuned (this work)	0.735 ± 0.004	0.745 ± 0.016	0.478 ± 0.003

Per-seed test MCC: 0.481, 0.478, 0.474 (σ = 0.003), indicating a stable result across initialisations.

Promoters: `human_nontata_promoters` (9,034 test examples, seed 42)

Model	Accuracy	F1	MCC
Base (pretrained backbone, untrained head)	0.451	0.064	−0.044
Fine-tuned (this work)	0.872	0.878	0.747

Where it lands (published reference numbers, accuracy on enhancers)

Model	Accuracy	Note
GB-CNN (Grešová et al. 2023)	0.69	published reference
DNABERT (Ji et al. 2021)	0.706	published reference
This work — NT-v2 50M full fine-tune	0.735	measured here, 3 seeds
HyenaDNA tiny-1k (Nguyen et al. 2023)	0.74	published reference
NT-v2 500M (Dalla-Torre et al. 2023)	0.776	published reference
DNABERT-2 (Zhou et al. 2023)	0.785	published reference

On enhancers, the 50M model performs above the published CNN and DNABERT baselines and below the larger transformers, consistent with its parameter count. It does not outperform the 500M models, and is not intended to.

How it was trained

Method: full fine-tuning of all 53.8M parameters with a 2-layer classification head
Optimizer: AdamW, learning rate 1e-5, weight decay 0.01
Schedule: linear warmup (300 steps) with cosine decay, gradient clipping (max-norm 1.0)
Augmentation: reverse-complement
Protocol: 15% of training data held out for validation; checkpoint selected by validation MCC; test set evaluated once; 3 seeds for the primary task
Best epoch: 1–2, after which validation MCC declines; the low learning rate and validation-based selection limit overfitting beyond the early epochs

Intended use & limitations

Intended use: research and educational demonstration of adapting a genomic foundation model at the weight level on standard DNA classification benchmarks.
Not for: clinical, diagnostic, or any commercial use (the CC-BY-NC-SA-4.0 license forbids commercial use).
Scope: two GenomicBenchmarks tasks only; results reflect those datasets and may not transfer to other genomic tasks, species, or sequence lengths.

License & attribution

Fine-tuned weights: CC-BY-NC-SA-4.0 (inherited from the base model; non-commercial, share-alike, attribution).
Training/evaluation code: MIT (see the GitHub repository).
Attribution: base model by InstaDeep; benchmark datasets by the GenomicBenchmarks authors (Grešová et al. 2023).

Citation / related work

For the broader research context, see Fesser, Zhang, Li, Zitnik et al., How Post-Training Shapes Biological Reasoning Models (arXiv:2606.16517, 2026). Its Finding 1, that supervised fine-tuning improves in-domain accuracy while out-of-domain performance peaks early and declines, has an in-domain counterpart in this work: validation MCC peaked at epoch 1–2 and then declined, which motivates the use of a low learning rate and validation-based checkpoint selection rather than longer training. This is a single controlled fine-tune, not a comparable research program; the connection is one of shared principle, not scope.

Developed by Ankur Sharma — GitHub · LinkedIn

Code and full results: github.com/ankurgenomics/genome-ft

This is a personal open-source project, developed independently in a personal capacity. It is not affiliated with, endorsed by, or representative of any current or former employer, and uses only public models and public benchmark datasets. All views and results are the author's own.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for ankur0050/nucleotide-transformer-v2-50m-genomicbenchmarks-ft

Base model

InstaDeepAI/nucleotide-transformer-v2-50m-multi-species

Finetuned

(21)

this model

Datasets used to train ankur0050/nucleotide-transformer-v2-50m-genomicbenchmarks-ft

Paper for ankur0050/nucleotide-transformer-v2-50m-genomicbenchmarks-ft

How Post-Training Shapes Biological Reasoning Models

Paper • 2606.16517 • Published 13 days ago • 3

Evaluation results

Test MCC (mean of 3 seeds) on GenomicBenchmarks human_enhancers_cohn
self-reported

0.478
Test accuracy (mean of 3 seeds) on GenomicBenchmarks human_enhancers_cohn
self-reported

0.735
Test F1 (mean of 3 seeds) on GenomicBenchmarks human_enhancers_cohn
self-reported

0.745
Test MCC (seed 42) on GenomicBenchmarks human_nontata_promoters
self-reported

0.747
Test accuracy (seed 42) on GenomicBenchmarks human_nontata_promoters
self-reported

0.872
Test F1 (seed 42) on GenomicBenchmarks human_nontata_promoters
self-reported

0.878