mhcSFM — Peptide ↔ MHC Class I Binding Specificity Foundation Model

Paper: Vibe Coding Specificity Foundation Models · doi: 10.64898/2026.06.04.730134 All VC-SFM models: huggingface.co/SFM-BIIE-ETHZ Code: github.com/SFM-BIIE-ETHZ/Vibe-Coding-SFMs


What it does

This SFM learns a joint embedding space for peptides and MHC class I alleles via contrastive learning on binding-affinity data. Given a peptide, retrieve the MHC alleles most likely to present it — pan-allele prediction without per-allele fine-tuning.

Component Model
Agent encoder ESM-2
Target encoder ESM-2 (34-aa HLA pseudo-sequence)
Training data NetMHCpan 4.1 binding-affinity corpus (38,837 peptides / 105 alleles; 168,710 pairs)
Split Sequence-identity 100% holdout · fold 0

Performance — pool-512 retrieval (from the paper)

Evaluated by pool-512 retrieval: each test pair's true target is placed in a pool of 512 candidates (1 positive + 511 random negatives), scored by cosine similarity, over 100 random trials at the best-validation checkpoint. Random baseline = 0.2%. Values are the 5-fold cross-validated mean ± SD (folds 0–3; fold 4 excluded for split degeneracy) reported in the paper for this SFM. The released checkpoint is the fold-0, identity-100 model trained with the identical configuration, data, and split.

Direction R@1 (%) R@5 (%) R@10 (%)
peptide → HLA 65.1 ± 1.9 78.4 ± 1.6 81.8 ± 2.2
HLA → peptide 93.3 ± 2.6 99.1 ± 1.1 99.5 ± 0.6

*This released checkpoint is the original fold-0 model; its own pool-512 eval gives 66.1% (peptide→HLA) / 92.1% (HLA→peptide) R@1, matching the paper fold-0 (64.6 / 95.0) within eval noise. On a held-out rare-allele (zero-shot) set, HLA→peptide R@1 = 95.4%.*


Quick start

from huggingface_hub import hf_hub_download
import torch, torch.nn.functional as F

ckpt_path = hf_hub_download("SFM-BIIE-ETHZ/mhcSFM_VC-SFM", "model.pth")

# Load with the Vibe-Coding-SFMs codebase
# (https://github.com/SFM-BIIE-ETHZ/Vibe-Coding-SFMs)
from calm.encoder.model import CALMEncoder
model = CALMEncoder.from_pretrained(ckpt_path)
model.eval()

agent_emb  = model.encode_query("GILGFVFTL")                  # peptide
target_emb = model.encode_target("YFAMYQENVAQTDVDTLYIIYRDAQTFQV")  # HLA pseudo-seq

score = F.cosine_similarity(agent_emb, target_emb, dim=-1)

Files in this repo

File Description
model.pth Released checkpoint · fold 0 · identity-100 split
results_train_val_test.csv Per-epoch training/validation/test logs (training-time batch metrics, not the pool-512 numbers above)

Citation

@article{reddy2026vcsfm,
  title   = {Vibe Coding Specificity Foundation Models},
  author  = {Reddy, Sai T.},
  journal = {bioRxiv},
  year    = {2026},
  doi     = {10.64898/2026.06.04.730134}
}

License

Released under the SFM Research Preview License v1.0-preview (see LICENSE.md). Free for research use — academic, non-profit, government, and industry research. The specific molecules disclosed in the accompanying preprints are dedicated to the public. Commercial-use and patent-licensing terms are deferred and being arranged with ETH Zürich / BIIE; the SFM architectures and training methods are the subject of pending patent applications. For commercial enquiries: sai.reddy@ethz.ch

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support