---
title: PeVe — Deterministic Variant Reasoning Engine
emoji: 🧬
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 4.20.0
app_file: app.py
pinned: false
license: mit
python_version: "3.10"
---

# PeVe v1.1 — Deterministic Variant Reasoning Engine

**PeVe** (Pathogenicity Evidence engine) is a three-layer biological mechanism framework
for genomic variant interpretation. It integrates three Hugging Face models and applies
a deterministic, non-linear evidence synthesis engine to produce structured mechanism
classifications.

---

## Architecture

PeVe does NOT:
- Average model probabilities
- Use confidence scoring or Monte Carlo uncertainty
- Perform weighted ensembling
- Use auto-updating thresholds

PeVe DOES:
- Run three biologically distinct models in parallel
- Apply fixed, versioned activation thresholds
- Use hierarchical dominance logic (not voting)
- Apply a tiered conflict taxonomy
- Generate deterministic, template-based reasoning narratives

---

## Three Biological Layers

### Layer 1 — RNA Mechanism (`mutation-predictor-splice`)
**Biological question:** Is RNA splicing disrupted by this mutation?

Inputs: 401bp sequence window, mutation encoding, splice-region flags  
Outputs: `splice_prob`, `splice_signal_strength`, `counterfactual_delta`, `saliency_map`

Activation: `splice_prob ≥ 0.8 AND splice_signal_strength ≥ 0.65`  
Dominant: `splice_prob ≥ 0.9` (High band only)

---

### Layer 2 — Sequence Context (`mutation-predictor-v4`)
**Biological question:** Does local DNA sequence context show disruptive signal centred at mutation?

Inputs: 401bp sequence window, mutation encoding (NO splice flags — prevents leakage)  
Outputs: `context_pathogenic_prob`, `activation_norm`, `activation_peak_position`, `importance_score`

Activation: `activation_norm ≥ 0.70`

---

### Layer 3 — Protein & Population (`mutation-pathogenicity-predictor`)
**Biological question:** Does protein biochemical impact and population rarity support pathogenicity?

Inputs: gnomAD AF, Grantham score, charge change, hydrophobicity difference, protein position, VEP IMPACT  
Outputs: `biochemical_risk_score`, `shap_feature_contributions`, `feature_pathogenic_prob`

Activation: `biochemical_risk_score ≥ 0.6 AND AF < 0.001`

---

## Dominance Hierarchy (Synthesis Rules)

```
Rule 1:  RNA High (≥0.9)                    → dominant = RNA_Splicing
Rule 1b: RNA Moderate + Protein Active      → dominant = Mechanism_Ambiguity
Rule 2:  RNA inactive + Protein Active      → dominant = Protein_Biochemical
Rule 3:  RNA inactive + Protein inactive
         + Context Active                   → dominant = Sequence_Context
Rule 4:  None active                        → Insufficient_Evidence
```

All three layers always execute. Routing modifies interpretation priority only.

---

## Conflict Taxonomy

### Major Conflicts (any 1 → Manual Review)
- High `splice_prob` + `AF > 0.01` — splice disruption contradicted by population frequency
- High biochemical risk + `AF > 0.01` — protein disruption contradicted by common variant
- Canonical splice site destroyed + splice model inactive — annotation/model disagreement

### Minor Conflicts (2+ → Manual Review)
- Activation value within ±0.05 of decision threshold
- Activation peak >10% of window (40bp) from mutation centre
- High context signal + benign VEP consequence (synonymous/intronic)
- AF state is UNKNOWN or UNCERTAIN

---

## Variant Class Pre-filter

Before any model runs, variants are categorised:

| Class | L3 Biochemistry | RNA Priority |
|---|---|---|
| substitution_missense | ✓ Valid | Normal |
| substitution_synonymous | ✓ Valid | Normal |
| canonical_splice | ✗ Supportive only | ↑ Elevated |
| frameshift | ✗ Not Applicable | Normal |
| stop_gained | ✗ Not Applicable | Normal |
| start_lost | ✗ Not Applicable | Normal |
| in_frame_indel | ✗ Not Applicable | Normal |
| deep_intronic | Contextual | ↓ De-prioritised |
| utr_regulatory | ✗ N/A | ✗ Out of scope v1.1 |

---

## AF Handling

gnomAD allele frequencies are classified into four states:

- `AF_NUMERIC` — numeric value, well-covered region
- `AF_ZERO` — confirmed absent, adequate coverage → satisfies rarity
- `AF_UNCERTAIN` — AF=0 but coverage insufficient → does NOT satisfy rarity
- `AF_UNKNOWN` — variant absent from gnomAD, no coverage data → does NOT satisfy rarity

Founder variant detection: if any subpopulation AF is >10× global AF and >0.005,
a stratification warning is raised. The global rarity threshold is NOT applied.

---

## Thresholds (Frozen, Versioned)

| Parameter | Threshold | Version |
|---|---|---|
| RNA High (dominant) | splice_prob ≥ 0.90 | 2024-01 |
| RNA Active | splice_prob ≥ 0.80 AND signal ≥ 0.65 | 2024-01 |
| Context Active | activation_norm ≥ 0.70 | 2024-01 |
| Protein Active | biochemical_risk ≥ 0.60 AND AF < 0.001 | 2024-01 |
| High AF conflict | AF > 0.01 | 2024-01 |
| Boundary flag | ±0.05 of any threshold | 2024-01 |

---

## Known Limitations

1. **Tissue specificity** — models trained on general cell line data; tissue-specific splice effects not captured
2. **Compound heterozygosity** — single-variant assessment only; trans effects not evaluated
3. **UTR/regulatory variants** — categorised but no mechanism pathway in v1.1
4. **MNV (multi-nucleotide variants)** — flagged as out-of-scope; component SNVs should be assessed individually
5. **Penetrance/expressivity** — not modelled
6. **gnomAD versioning** — results pinned to gnomAD v4.0; re-query if using other releases

---

## Output Structure

```json
{
  "peve_version": "1.1.0",
  "threshold_version": "2024-01",
  "dominant_mechanism": "RNA_Splicing | Protein_Biochemical | Sequence_Context | ...",
  "final_classification": "Pathogenic — RNA Splice Mechanism",
  "activation_levels": { ... },
  "layer_outputs": { "RNA": {}, "context": {}, "protein": {} },
  "af": { "state": "AF_NUMERIC", "global_af": 0.00002, ... },
  "conflict_report": { "major_conflicts": [], "minor_conflicts": [], ... },
  "reasoning_steps": ["RULE 1: RNA mechanism is HIGH ..."],
  "prefilter_flags": []
}
```

---

## Example Test Variants

| Variant | Expected Mechanism | Notes |
|---|---|---|
| chr17:43092176 G>T | RNA_Splicing | BRCA1 splice donor region |
| chr17:7675088 C>T | Protein_Biochemical | TP53 R175H missense |
| chr1:69270 A>G | Insufficient_Evidence | Common benign synonymous |

---

## Repository Structure

```
app.py                    — Gradio UI + pipeline orchestration
config.py                 — Frozen thresholds and version constants
prefilter.py              — Variant class categorisation
af_handler.py             — gnomAD AF retrieval and null handling
model_loader.py           — HF Hub model loading with fallback
decision_engine.py        — Deterministic synthesis engine + narrative generator
explainability_renderer.py — All visualisations (matplotlib)
requirements.txt
README.md
```

---

> ⚠ **Research tool only.** PeVe v1.1 has not been validated for clinical diagnostic use.
> All outputs must be interpreted by qualified professionals in full clinical context.
> Results are deterministic but bounded by underlying model calibration quality.