GenoLeWM checkpoint and evidence bundle
GenoLeWM is an alpha research system for action-conditioned latent world models over genomic edits. This model repository publishes the stable v0.1 checkpoint package: trainable GenoLeWM predictor and action-encoder weights, calibration data, manifest-bound hashes, training evidence, evaluation evidence, and release checksums.
This is not a standard transformers.AutoModel.from_pretrained() checkpoint.
Load it through the geno-lewm runtime. Carbon-500M is a frozen state encoder
dependency and is resolved separately from this repository.
Claim Boundary
Use this checkpoint for reproducible research, local artifact inspection, method development, and demonstration scoring. Do not use it as a diagnostic model, clinical decision-support system, deployment-readiness claim, privacy claim, or evidence of broad superiority over Carbon. The measured results below are narrow artifact-level evaluations, and the broader v0.2.1 run-tree results are mixed or negative against the measured baselines.
No clinical utility claim is made.
At A Glance
| Item | Value |
|---|---|
| Stable package in this repo | geno-lewm-v0.1.0-r1 |
| Newer run-tree checkpoint used by the Space | geno-lewm-v0.2.1-r1 under abdelstark/geno-lewm-runs |
| Runtime | geno-lewm package, manifest verification, local Carbon-500M encoder |
| Main task | Single-SNV surprise scoring over reference windows |
| Strongest honest conclusion | Reproducible systems artifact with measured negative and mixed model-quality findings |
| Not supported | Clinical utility, deployment readiness, privacy assurance, runtime assurance, broad model superiority |
Which Checkpoint Should I Use?
| Target | Location | Use when |
|---|---|---|
| Stable package | abdelstark/geno-lewm |
You need the checksum-bound v0.1 release package with a stable manifest and package-level evidence. |
| Space/default demo checkpoint | geno-lewm-v021-strong-4f36eef-10k-r1/suite/model |
You want the newer v0.2.1 checkpoint used by the Hugging Face Space and the broader benchmark/planning evidence. |
| Interactive demo | abdelstark/geno-lewm Space |
You want to inspect artifacts or score a compatible single variant through the hosted UI. |
Published Artifacts
| Artifact | Location | Notes |
|---|---|---|
| Stable model package | this repository | Public package geno-lewm-v0.1.0-r1. |
| Generated package card | model_card.md |
Checksum-bound output from tools.release.model_package; kept terse so it can be re-rendered exactly. |
| Manifest | manifest.json |
Canonical identity for weights, calibration, config, encoder, and eval report. |
| Package checksums | SHA256SUMS |
Hashes for the v0.1 release package files. |
| Training evidence | training_run_manifest.json, training_run_card.md, training_run_SHA256SUMS |
Carbon-backed training-run metadata and hashes. |
| Evaluation evidence | eval_metrics.json, eval_report.md, eval_config.effective.yaml |
Held-out chr21 ClinVar evaluation for v0.1. |
| Efficiency evidence | efficiency_report.json |
Release efficiency measurement for v0.1. |
| Dataset package | abdelstark/geno-lewm-data |
Public data snapshot and data card. |
| v0.2.1 run tree | abdelstark/geno-lewm-runs |
Newer checkpoint, broader benchmark suite, planning demo, and generated paper. |
| Generated paper | paper.serious-completion.md |
Manuscript-style synthesis of experiments, learnings, negative findings, and benchmark limits. |
Model Identity
| Field | Value |
|---|---|
| Release id | geno-lewm-v0.1.0-r1 |
| Model version | 0.1.0 |
| Manifest id | sha256:861ec142cc87f3fac01751ef538553356dfba439e6da99064b4adb121e75c215 |
| Predictor artifact | predictor.safetensors |
| Predictor hash | sha256:6642c604a1352727969c86664f291fd6d2193c1c65bc6f9baf9b716469c52731 |
| Action encoder hash | sha256:8b2311d768855ab440b26dbbef5ddbda252cc8bb2c69509d28fa4bcf8eff025a |
| Calibration hash | sha256:d4cf4778ac8e5557d363aca43cd13723b0ed9983b83215ab164d2b642b886201 |
| Frozen encoder | Carbon-500M, mounted as /carbon in release jobs |
| Encoder revision | 5d31d59b3c845b288a13aedb1358934196852eec |
| Dataset snapshot | geno-lewm-data-v0.1.0-r1 |
The newer Space default checkpoint has model id
sha256:cddb8f3b9671090201370b9824b9da741b933ff296b651238f022df5f3ed6af4
and lives in the v0.2.1 run tree. It is run evidence, not a replacement for
the stable v0.1 package in this repository.
Method
GenoLeWM treats an edit as an action over a frozen genomic state embedding. For a reference window, Carbon-500M encodes the pre-edit state. GenoLeWM's trainable action encoder represents the edit, and the predictor estimates the post-edit latent state. Surprise scores are derived from the relationship between predicted and observed post-edit states, then calibrated by the packaged calibration artifact.
The package intentionally separates:
- frozen state encoder: Carbon-500M, loaded separately;
- trainable GenoLeWM weights:
predictor.safetensorsandaction_encoder.safetensors; - release identity:
manifest.jsonandSHA256SUMS; - evidence: training, evaluation, efficiency, and run-tree reports.
Install And Load
Install the package:
python -m pip install "geno-lewm[train,eval]==0.2.1"
Download the stable v0.1 model package:
from huggingface_hub import snapshot_download
model_dir = snapshot_download("abdelstark/geno-lewm")
Download the newer v0.2.1 run-tree model used by the Space:
from pathlib import Path
from huggingface_hub import snapshot_download
prefix = "geno-lewm-v021-strong-4f36eef-10k-r1/suite/model"
run_root = snapshot_download(
"abdelstark/geno-lewm-runs",
allow_patterns=f"{prefix}/*",
)
model_dir = Path(run_root) / prefix
Carbon-500M must also be available. The release manifests record the encoder as
/carbon because training, evaluation, and demo jobs mounted
HuggingFaceBio/Carbon-500M there at revision
5d31d59b3c845b288a13aedb1358934196852eec. The Space resolves and remaps that
encoder from the Hub before scoring.
Scoring Contract
The scorer is strict about reference consistency. The REF allele in
chrom:pos:ref:alt must match the supplied reference window at the one-based
locus implied by pos and --window-start-bp. If it does not match, scoring
raises WindowMismatchError before model inference.
Example single-variant score with a synthetic reference-matched window:
geno-lewm-score \
--model-dir "$MODEL_DIR" \
--backend auto \
--variant chrSynthetic:3073:A:T \
--window "$(python - <<'PY'
print("ACGT" * 3072)
PY
)" \
--window-start-bp 0 \
--receipt receipt.json
For real variants, use a FASTA-backed VCF path so the runtime builds windows from the same reference assembly as the variant coordinates.
v0.1 Training Summary
The stable v0.1 checkpoint was trained as a JEPA-style predictor over frozen Carbon-500M latent states.
| Field | Value |
|---|---|
| Run id | first-snv-carbon-500m-r1 |
| Config | training_config.effective.yaml |
| Commit | cd2bfccb33ec5a2df3c4707e8be8443f4682dad3 |
| Samples | 160,000 |
| Steps | 20,000 |
| Final training loss | 0.36124 |
| Status | completed |
v0.1 Evaluation
Held-out ClinVar GRCh38 chr21, binary P/LP versus B/LB labels. Scores use
sigma_raw; intervals are deterministic stratified bootstrap confidence
intervals from eval_metrics.json.
| Split | N | Positives | Negatives | Metric | Value | 95% CI |
|---|---|---|---|---|---|---|
eval_clinvar_chr21 |
3,000 | 494 | 2,506 | AUROC | 0.519160 | 0.491366 to 0.546846 |
eval_clinvar_chr21 |
3,000 | 494 | 2,506 | Average precision | 0.165174 | 0.155331 to 0.177035 |
eval_clinvar_chr21 |
3,000 | 494 | 2,506 | Balanced accuracy at 0.5 | 0.500000 | 0.500000 to 0.500000 |
eval_clinvar_chr21 |
3,000 | 494 | 2,506 | Accuracy at 0.5 | 0.164667 | 0.164667 to 0.164667 |
Interpretation: this v0.1 slice does not establish useful clinical performance, non-coding performance, multi-edit behavior, or superiority over Carbon.
v0.1 Efficiency
Measured by tools.release.efficiency_report on cuda:NVIDIA H200.
| Measurement | Value |
|---|---|
| Single-variant latency | 494.056 ms |
| Batched throughput | 2.024 variants/s |
| Peak memory | 1,152,656,384 bytes |
v0.2.1 Benchmark Evidence
The v0.2.1 run tree includes broader benchmark coverage and Carbon zero-shot comparisons. Results remain mixed or negative relative to the measured baselines.
| Slice | N | Metric | GenoLeWM | Baseline | Delta |
|---|---|---|---|---|---|
| ClinVar coding | 16 | AUROC | 0.734375 | 0.921875 | -0.187500 |
| ClinVar coding | 16 | Average precision | 0.852976 | 0.951923 | -0.098947 |
| ClinVar coding | 16 | Balanced accuracy | 0.750000 | 0.687500 | +0.062500 |
| ClinVar non-coding | 16 | AUROC | 0.562500 | 0.875000 | -0.312500 |
| ClinVar non-coding | 16 | Average precision | 0.605456 | 0.914423 | -0.308967 |
| ClinVar non-coding | 16 | Balanced accuracy | 0.437500 | 0.687500 | -0.250000 |
| BRCA2 saturation | 32 | Spearman rho | 0.149194 | 0.476906 | -0.327713 |
| TraitGym Mendelian | 32 | Spearman rho | -0.027965 | -0.083894 | +0.055929 |
| Phased-haplotype rollout | 8 | Cosine mean | 0.288861 | 0.997831 | -0.708970 |
| Synthetic edit-chain rollout | 8 | Cosine mean | 0.301608 | 0.991240 | -0.689631 |
The v0.2.1 readiness report is ok=true for artifact coverage and provenance.
That is not a model-quality success claim. The autoregressive rollout speed
report is ok=false: K=5 measured 2.41x speedup against a 2x target, while
K=20 measured 2.47x against a 5x target and missed the target.
The v0.2.1 efficiency report measured one sample with no warmup on
cuda:NVIDIA H200: 115,262.94 ms single-variant latency, 0.3095 variants/s
throughput, and 1,966,149,632 bytes peak memory. Treat that as run evidence,
not a production serving benchmark.
Planning Demo Evidence
The v0.2.1 run tree includes a deterministic synthetic multi-SNV planning demo.
It ran with manifest-backed runtime evaluation and recorded 384 evaluations,
best_distance=23.656930390534644, and stopped_reason=patience.
This is not useful-planning evidence. It shows that the planner, checkpoint, window artifacts, and receipt path run end to end; it does not show biological or clinical planning utility.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
WindowMismatchError: window bases do not match edit.ref_bases at locus |
The variant REF allele does not match the supplied window at the requested coordinate. |
Use a FASTA-backed VCF path or correct --window, --window-start-bp, and the variant REF allele so they describe the same reference sequence. |
| Space says scoring did not complete and mentions Carbon-500M | The runtime could not resolve or mount the frozen Carbon encoder, or optional ML dependencies failed to load. | Inspect the artifact panel, reload the Space after dependency startup, or run locally with Carbon-500M available. |
AutoModel.from_pretrained("abdelstark/geno-lewm") fails |
This repo is a GenoLeWM package, not a native Transformers architecture repo. | Use snapshot_download plus geno-lewm-score or the geno_lewm.deploy runtime. |
Limitations
- Alpha research checkpoint; not a clinical, diagnostic, or deployment model.
- v0.1 evaluation is narrow: held-out chr21 ClinVar P/LP versus B/LB labels.
- v0.2.1 benchmark evidence is broader but mixed, with multiple negative deltas versus Carbon zero-shot and source-state rollout baselines.
- Carbon-500M is required at runtime and is resolved separately from this model package.
- Calibration is proof-scale and should be interpreted only within the reported artifact context.
- Fixture outputs and UI demos are not model-quality evidence.
Citation And Reports
- Paper:
paper.serious-completion.md - Paper verifier:
paper_package_report.json - Project docs: https://abdelstark.github.io/GenoLeWM/
- Source repository: https://github.com/AbdelStark/GenoLeWM
License
Apache-2.0 for GenoLeWM source and metadata. Upstream Carbon-500M and dataset terms apply to their respective artifacts.
Model tree for abdelstark/geno-lewm
Base model
HuggingFaceBio/Carbon-500M