--- license: other license_name: sfm-research-preview-license-v1.0 license_link: LICENSE.md tags: - immunology - mhc - peptide - antigen-presentation - vc-sfm - biology - bioinformatics pipeline_tag: feature-extraction --- # mhcSFM — Peptide ↔ MHC Class I Binding Specificity Foundation Model **Paper:** Vibe Coding Specificity Foundation Models · doi: [10.64898/2026.06.04.730134](https://doi.org/10.64898/2026.06.04.730134) **All VC-SFM models:** [huggingface.co/SFM-BIIE-ETHZ](https://huggingface.co/SFM-BIIE-ETHZ) **Code:** [github.com/SFM-BIIE-ETHZ/Vibe-Coding-SFMs](https://github.com/SFM-BIIE-ETHZ/Vibe-Coding-SFMs) --- ## What it does This SFM learns a joint embedding space for **peptides** and **MHC class I alleles** via contrastive learning on binding-affinity data. Given a peptide, retrieve the MHC alleles most likely to present it — pan-allele prediction without per-allele fine-tuning. | Component | Model | |-----------|-------| | Agent encoder | ESM-2 | | Target encoder | ESM-2 (34-aa HLA pseudo-sequence) | | Training data | NetMHCpan 4.1 binding-affinity corpus (38,837 peptides / 105 alleles; 168,710 pairs) | | Split | Sequence-identity 100% holdout · fold 0 | --- ## Performance — pool-512 retrieval (from the paper) Evaluated by **pool-512 retrieval**: each test pair's true target is placed in a pool of 512 candidates (1 positive + 511 random negatives), scored by cosine similarity, over 100 random trials at the best-validation checkpoint. Random baseline = 0.2%. Values are the **5-fold cross-validated mean ± SD** (folds 0–3; fold 4 excluded for split degeneracy) reported in the paper for this SFM. The released checkpoint is the **fold-0, identity-100** model trained with the identical configuration, data, and split. | Direction | R@1 (%) | R@5 (%) | R@10 (%) | |-----------|---------|---------|----------| | peptide → HLA | 65.1 ± 1.9 | 78.4 ± 1.6 | 81.8 ± 2.2 | | HLA → peptide | 93.3 ± 2.6 | 99.1 ± 1.1 | 99.5 ± 0.6 | *This released checkpoint is the original fold-0 model; its own pool-512 eval gives **66.1% (peptide→HLA) / 92.1% (HLA→peptide)** R@1, matching the paper fold-0 (64.6 / 95.0) within eval noise. On a held-out rare-allele (zero-shot) set, HLA→peptide R@1 = **95.4%**.* --- ## Quick start ```python from huggingface_hub import hf_hub_download import torch, torch.nn.functional as F ckpt_path = hf_hub_download("SFM-BIIE-ETHZ/mhcSFM_VC-SFM", "model.pth") # Load with the Vibe-Coding-SFMs codebase # (https://github.com/SFM-BIIE-ETHZ/Vibe-Coding-SFMs) from calm.encoder.model import CALMEncoder model = CALMEncoder.from_pretrained(ckpt_path) model.eval() agent_emb = model.encode_query("GILGFVFTL") # peptide target_emb = model.encode_target("YFAMYQENVAQTDVDTLYIIYRDAQTFQV") # HLA pseudo-seq score = F.cosine_similarity(agent_emb, target_emb, dim=-1) ``` --- ## Files in this repo | File | Description | |------|-------------| | `model.pth` | Released checkpoint · fold 0 · identity-100 split | | `results_train_val_test.csv` | Per-epoch training/validation/test logs (training-time batch metrics, **not** the pool-512 numbers above) | --- ## Citation ```bibtex @article{reddy2026vcsfm, title = {Vibe Coding Specificity Foundation Models}, author = {Reddy, Sai T.}, journal = {bioRxiv}, year = {2026}, doi = {10.64898/2026.06.04.730134} } ``` ## License Released under the **SFM Research Preview License v1.0-preview** (see `LICENSE.md`). Free for research use — academic, non-profit, government, and industry research. The specific molecules disclosed in the accompanying preprints are dedicated to the public. Commercial-use and patent-licensing terms are deferred and being arranged with ETH Zürich / BIIE; the SFM architectures and training methods are the subject of pending patent applications. For commercial enquiries: sai.reddy@ethz.ch