GenoLeWM checkpoint and evidence bundle

Space Checkpoint Dataset GitHub

GenoLeWM is an alpha research system for action-conditioned latent world models over genomic edits. This model repository publishes the stable v0.1 checkpoint package: trainable GenoLeWM predictor and action-encoder weights, calibration data, manifest-bound hashes, training evidence, evaluation evidence, and release checksums.

This is not a standard transformers.AutoModel.from_pretrained() checkpoint. Load it through the geno-lewm runtime. Carbon-500M is a frozen state encoder dependency and is resolved separately from this repository.

Claim Boundary

Use this checkpoint for reproducible research, local artifact inspection, method development, and demonstration scoring. Do not use it as a diagnostic model, clinical decision-support system, deployment-readiness claim, privacy claim, or evidence of broad superiority over Carbon. The measured results below are narrow artifact-level evaluations, and the broader v0.2.1 run-tree results are mixed or negative against the measured baselines.

No clinical utility claim is made.

At A Glance

Item Value
Stable package in this repo geno-lewm-v0.1.0-r1
Newer run-tree checkpoint used by the Space geno-lewm-v0.2.1-r1 under abdelstark/geno-lewm-runs
Runtime geno-lewm package, manifest verification, local Carbon-500M encoder
Main task Single-SNV surprise scoring over reference windows
Strongest honest conclusion Reproducible systems artifact with measured negative and mixed model-quality findings
Not supported Clinical utility, deployment readiness, privacy assurance, runtime assurance, broad model superiority

Which Checkpoint Should I Use?

Target Location Use when
Stable package abdelstark/geno-lewm You need the checksum-bound v0.1 release package with a stable manifest and package-level evidence.
Space/default demo checkpoint geno-lewm-v021-strong-4f36eef-10k-r1/suite/model You want the newer v0.2.1 checkpoint used by the Hugging Face Space and the broader benchmark/planning evidence.
Interactive demo abdelstark/geno-lewm Space You want to inspect artifacts or score a compatible single variant through the hosted UI.

Published Artifacts

Artifact Location Notes
Stable model package this repository Public package geno-lewm-v0.1.0-r1.
Generated package card model_card.md Checksum-bound output from tools.release.model_package; kept terse so it can be re-rendered exactly.
Manifest manifest.json Canonical identity for weights, calibration, config, encoder, and eval report.
Package checksums SHA256SUMS Hashes for the v0.1 release package files.
Training evidence training_run_manifest.json, training_run_card.md, training_run_SHA256SUMS Carbon-backed training-run metadata and hashes.
Evaluation evidence eval_metrics.json, eval_report.md, eval_config.effective.yaml Held-out chr21 ClinVar evaluation for v0.1.
Efficiency evidence efficiency_report.json Release efficiency measurement for v0.1.
Dataset package abdelstark/geno-lewm-data Public data snapshot and data card.
v0.2.1 run tree abdelstark/geno-lewm-runs Newer checkpoint, broader benchmark suite, planning demo, and generated paper.
Generated paper paper.serious-completion.md Manuscript-style synthesis of experiments, learnings, negative findings, and benchmark limits.

Model Identity

Field Value
Release id geno-lewm-v0.1.0-r1
Model version 0.1.0
Manifest id sha256:861ec142cc87f3fac01751ef538553356dfba439e6da99064b4adb121e75c215
Predictor artifact predictor.safetensors
Predictor hash sha256:6642c604a1352727969c86664f291fd6d2193c1c65bc6f9baf9b716469c52731
Action encoder hash sha256:8b2311d768855ab440b26dbbef5ddbda252cc8bb2c69509d28fa4bcf8eff025a
Calibration hash sha256:d4cf4778ac8e5557d363aca43cd13723b0ed9983b83215ab164d2b642b886201
Frozen encoder Carbon-500M, mounted as /carbon in release jobs
Encoder revision 5d31d59b3c845b288a13aedb1358934196852eec
Dataset snapshot geno-lewm-data-v0.1.0-r1

The newer Space default checkpoint has model id sha256:cddb8f3b9671090201370b9824b9da741b933ff296b651238f022df5f3ed6af4 and lives in the v0.2.1 run tree. It is run evidence, not a replacement for the stable v0.1 package in this repository.

Method

GenoLeWM treats an edit as an action over a frozen genomic state embedding. For a reference window, Carbon-500M encodes the pre-edit state. GenoLeWM's trainable action encoder represents the edit, and the predictor estimates the post-edit latent state. Surprise scores are derived from the relationship between predicted and observed post-edit states, then calibrated by the packaged calibration artifact.

The package intentionally separates:

  • frozen state encoder: Carbon-500M, loaded separately;
  • trainable GenoLeWM weights: predictor.safetensors and action_encoder.safetensors;
  • release identity: manifest.json and SHA256SUMS;
  • evidence: training, evaluation, efficiency, and run-tree reports.

Install And Load

Install the package:

python -m pip install "geno-lewm[train,eval]==0.2.1"

Download the stable v0.1 model package:

from huggingface_hub import snapshot_download

model_dir = snapshot_download("abdelstark/geno-lewm")

Download the newer v0.2.1 run-tree model used by the Space:

from pathlib import Path
from huggingface_hub import snapshot_download

prefix = "geno-lewm-v021-strong-4f36eef-10k-r1/suite/model"
run_root = snapshot_download(
    "abdelstark/geno-lewm-runs",
    allow_patterns=f"{prefix}/*",
)
model_dir = Path(run_root) / prefix

Carbon-500M must also be available. The release manifests record the encoder as /carbon because training, evaluation, and demo jobs mounted HuggingFaceBio/Carbon-500M there at revision 5d31d59b3c845b288a13aedb1358934196852eec. The Space resolves and remaps that encoder from the Hub before scoring.

Scoring Contract

The scorer is strict about reference consistency. The REF allele in chrom:pos:ref:alt must match the supplied reference window at the one-based locus implied by pos and --window-start-bp. If it does not match, scoring raises WindowMismatchError before model inference.

Example single-variant score with a synthetic reference-matched window:

geno-lewm-score \
  --model-dir "$MODEL_DIR" \
  --backend auto \
  --variant chrSynthetic:3073:A:T \
  --window "$(python - <<'PY'
print("ACGT" * 3072)
PY
)" \
  --window-start-bp 0 \
  --receipt receipt.json

For real variants, use a FASTA-backed VCF path so the runtime builds windows from the same reference assembly as the variant coordinates.

v0.1 Training Summary

The stable v0.1 checkpoint was trained as a JEPA-style predictor over frozen Carbon-500M latent states.

Field Value
Run id first-snv-carbon-500m-r1
Config training_config.effective.yaml
Commit cd2bfccb33ec5a2df3c4707e8be8443f4682dad3
Samples 160,000
Steps 20,000
Final training loss 0.36124
Status completed

v0.1 Evaluation

Held-out ClinVar GRCh38 chr21, binary P/LP versus B/LB labels. Scores use sigma_raw; intervals are deterministic stratified bootstrap confidence intervals from eval_metrics.json.

Split N Positives Negatives Metric Value 95% CI
eval_clinvar_chr21 3,000 494 2,506 AUROC 0.519160 0.491366 to 0.546846
eval_clinvar_chr21 3,000 494 2,506 Average precision 0.165174 0.155331 to 0.177035
eval_clinvar_chr21 3,000 494 2,506 Balanced accuracy at 0.5 0.500000 0.500000 to 0.500000
eval_clinvar_chr21 3,000 494 2,506 Accuracy at 0.5 0.164667 0.164667 to 0.164667

Interpretation: this v0.1 slice does not establish useful clinical performance, non-coding performance, multi-edit behavior, or superiority over Carbon.

v0.1 Efficiency

Measured by tools.release.efficiency_report on cuda:NVIDIA H200.

Measurement Value
Single-variant latency 494.056 ms
Batched throughput 2.024 variants/s
Peak memory 1,152,656,384 bytes

v0.2.1 Benchmark Evidence

The v0.2.1 run tree includes broader benchmark coverage and Carbon zero-shot comparisons. Results remain mixed or negative relative to the measured baselines.

Slice N Metric GenoLeWM Baseline Delta
ClinVar coding 16 AUROC 0.734375 0.921875 -0.187500
ClinVar coding 16 Average precision 0.852976 0.951923 -0.098947
ClinVar coding 16 Balanced accuracy 0.750000 0.687500 +0.062500
ClinVar non-coding 16 AUROC 0.562500 0.875000 -0.312500
ClinVar non-coding 16 Average precision 0.605456 0.914423 -0.308967
ClinVar non-coding 16 Balanced accuracy 0.437500 0.687500 -0.250000
BRCA2 saturation 32 Spearman rho 0.149194 0.476906 -0.327713
TraitGym Mendelian 32 Spearman rho -0.027965 -0.083894 +0.055929
Phased-haplotype rollout 8 Cosine mean 0.288861 0.997831 -0.708970
Synthetic edit-chain rollout 8 Cosine mean 0.301608 0.991240 -0.689631

The v0.2.1 readiness report is ok=true for artifact coverage and provenance. That is not a model-quality success claim. The autoregressive rollout speed report is ok=false: K=5 measured 2.41x speedup against a 2x target, while K=20 measured 2.47x against a 5x target and missed the target.

The v0.2.1 efficiency report measured one sample with no warmup on cuda:NVIDIA H200: 115,262.94 ms single-variant latency, 0.3095 variants/s throughput, and 1,966,149,632 bytes peak memory. Treat that as run evidence, not a production serving benchmark.

Planning Demo Evidence

The v0.2.1 run tree includes a deterministic synthetic multi-SNV planning demo. It ran with manifest-backed runtime evaluation and recorded 384 evaluations, best_distance=23.656930390534644, and stopped_reason=patience.

This is not useful-planning evidence. It shows that the planner, checkpoint, window artifacts, and receipt path run end to end; it does not show biological or clinical planning utility.

Troubleshooting

Symptom Likely cause Fix
WindowMismatchError: window bases do not match edit.ref_bases at locus The variant REF allele does not match the supplied window at the requested coordinate. Use a FASTA-backed VCF path or correct --window, --window-start-bp, and the variant REF allele so they describe the same reference sequence.
Space says scoring did not complete and mentions Carbon-500M The runtime could not resolve or mount the frozen Carbon encoder, or optional ML dependencies failed to load. Inspect the artifact panel, reload the Space after dependency startup, or run locally with Carbon-500M available.
AutoModel.from_pretrained("abdelstark/geno-lewm") fails This repo is a GenoLeWM package, not a native Transformers architecture repo. Use snapshot_download plus geno-lewm-score or the geno_lewm.deploy runtime.

Limitations

  • Alpha research checkpoint; not a clinical, diagnostic, or deployment model.
  • v0.1 evaluation is narrow: held-out chr21 ClinVar P/LP versus B/LB labels.
  • v0.2.1 benchmark evidence is broader but mixed, with multiple negative deltas versus Carbon zero-shot and source-state rollout baselines.
  • Carbon-500M is required at runtime and is resolved separately from this model package.
  • Calibration is proof-scale and should be interpreted only within the reported artifact context.
  • Fixture outputs and UI demos are not model-quality evidence.

Citation And Reports

License

Apache-2.0 for GenoLeWM source and metadata. Upstream Carbon-500M and dataset terms apply to their respective artifacts.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for abdelstark/geno-lewm

Finetuned
(1)
this model

Dataset used to train abdelstark/geno-lewm

Space using abdelstark/geno-lewm 1