docs/METHODOLOGY_REPORT.md · RyeCatcher/openadmet-pxr-challenge-2026 at 0bc29c350687b026746c1c42d886f716f1bbb7fa

openadmet-pxr-challenge-2026
File size: 11,698 Bytes
# Methodology Report — OpenADMET PXR Induction Blind Challenge

**Team:** BioInfo (RyeCatcher)  
**Track:** Activity Prediction  
**Best Submission:** v43 (LB RAE = 0.586, rank ~40/211 as of 2026-05-10)  
**Contact:** justin@rundatarun.io  
**Code, models, OOF predictions, and submission:** https://huggingface.co/RyeCatcher/openadmet-pxr-challenge-2026
**Report version:** 1.1 (2026-05-19)

---

## 1. Overview

Our approach is an ensemble of three model families: (1) a Chemprop D-MPNN multitask model pretrained on external ADMET data, (2) a gradient-boosted decision tree ensemble on rich molecular descriptors, and (3) an AutoGluon-tabular model on CheMeleon foundation-model embeddings. Final predictions are produced via a weighted blend with isotonic calibration.

---

## 2. Data Used

### 2.1 OpenADMET Provided Data

We used all provided data sources:

| Source | Size | Usage |
|---|---|---|
| DRC pEC50 (train) | 4,139 compounds | Primary regression target |
| Counter-screen pEC50 | 2,647 compounds (subset of train) | Multitask Head 2 |
| Single-concentration log2fc | 21,003 rows (10,870 unique compounds) | Exploratory multitask (not in final ensemble) |
| Test (blinded) | 513 compounds | Prediction target |

### 2.2 External Data

We incorporated the following public datasets for pretraining and auxiliary multitask learning:

| Dataset | Source | Size | Usage |
|---|---|---|---|
| NCATS qHTS PXR (AID 1346982) | PubChem | 9,671 compounds | Binary activity classification head |
| NCATS qHTS LogAC50 (AID 1346985) | PubChem | 2,458 compounds | Regression head |
| Tox21 SR-ARE | MoleculeNet/DeepChem | ~8,000 compounds | Binary classification head (SR-PXR was unavailable in this release) |
| ChEMBL NR1I2 (PXR) | ChEMBL | 907 compounds with pchembl_value | Multitask Head 5 in T1v5 |
| BindingDB PXR (UniProt O75469) | BindingDB REST API | 364 novel compounds | Exploratory pretrain data |

**No proprietary data was used.** All external data was deduplicated against the OpenADMET train and test sets using canonical isomeric SMILES and InChIKey.

---

## 3. Feature Engineering

### 3.1 Molecular Fingerprints

- **FCFP4-count-1024**: RDKit `MorganGenerator` with `MorganFeatureAtomInvGen`, radius=2, fpSize=1024, count fingerprint. This outperformed binary ECFP4 and MACCS in our coverage analysis (90.6% test-to-train Tanimoto ≥ 0.4 coverage).

### 3.2 Molecular Descriptors

- **RDKit descriptors**: 217 descriptors from `useful_rdkit_utils.get_rdkit_desc_names()` including physicochemical properties, topological indices, and fragment counts.
- **Mordred**: 1,613 2D/3D descriptors computed with `mordredcommunity`.
- **CheMeleon embeddings**: 2,048-dimension frozen embeddings from the CheMeleon foundation model (Recursion Pharma), extracted via the OpenADMET toolkit.

### 3.3 Learned Embeddings

- **Chemprop D-MPNN fingerprints**: 300-dimension aggregation-layer embeddings extracted from the pretrained multitask model.

---

## 4. Models

### 4.1 Model Family A: Chemprop D-MPNN Multitask (T1)

A directed message-passing neural network (D-MPNN) implemented in Chemprop v2 with multitask pretraining:

- **Architecture**: BondMessagePassing (d_h=300, depth=3) → MeanAggregation → RegressionFFN (4 tasks, 3 hidden layers, dropout=0.1)
- **Heads**:
  1. pEC50 regression (OpenADMET DRC)
  2. NCATS qHTS binary active/inactive
  3. NCATS qHTS LogAC50 regression
  4. Tox21 SR-ARE binary classification
- **Pretraining**: 30 epochs on combined dataset (train + external) with equal head weights
- **Finetuning**: 50 epochs with Head 1 weighted 4× on OpenADMET train only
- **Loss**: MSE (binary tasks treated as regression to 0/1 targets)

### 4.2 Model Family B: LightGBM Kitchen-Sink (v4)

- **Features**: 4,902-dimension vector = RDKit 217 + FCFP4-count-1024 + Mordred 1,613 + CheMeleon 2,048
- **Model**: LightGBM with MSE loss, 5-fold CV, early stopping
- **Hyperparameters**: tuned via median fold-best-iteration heuristic for full-train model

### 4.3 Model Family C: AutoGluon on CheMeleon (T2)

- **Features**: CheMeleon 2,048-dimension frozen embeddings
- **Model**: AutoGluon TabularPredictor with `best_quality` preset
- **Runtime**: ~3.5 hours for 5-fold OOF generation

### 4.4 Model Family D: Chemprop + ChEMBL Multitask (T1v5)

Extends T1 with an additional head for ChEMBL NR1I2 PXR pchembl_value (907 compounds). 5-head Chemprop with the same architecture as T1.

---

## 5. Ensemble Strategy

Our final submission (v43) is a **cascade-weighted blend** built iteratively:

```
v7   = 0.50 × T1 + 0.38 × v4 + 0.12 × v3  (isotonic calibrated)
v22  = 0.975 × v7 + 0.025 × KERMT
v26  = 0.90 × v22 + 0.10 × CheMeleon-FT
S6   = 0.90 × v26 + 0.10 × TabPFN
v28  = S6 + T1v2 (CYP3A4 multitask) at w=0.12
v29  = v28 + T16 (cliff-weighted Chemprop) at w=0.10
v31  = 0.83 × v29 + 0.17 × T2 (AutoGluon-CheMeleon)
v43  = 0.78 × v31 + 0.22 × T1v5 (ChEMBL multitask)
```

All weights were determined by grid search maximizing honest OOF RAE improvement with a minimum Δ threshold of +0.003 vs the previous step.

**Calibration**: IsotonicRegression (out_of_bounds='clip') applied to blend outputs. For honest CV evaluation, per-fold IsotonicRegression is fitted on each fold's OOF separately to avoid leakage.

---

## 6. Validation Strategy

### 6.1 Cross-Validation

- **Scheme**: Butina clustering at Tanimoto cutoff 0.4 on ECFP4 2048-bit fingerprints, followed by GroupKFold (5 folds)
- **Rationale**: Groups chemically similar compounds together to simulate the challenge's analog-set test construction
- **Primary metric**: RAE (Relative Absolute Error) = Σ|y_true - y_pred| / Σ|y_true - mean(y_true)|

### 6.2 Honest Calibration Protocol

We distinguish two calibration protocols:
- **In-sample isotonic**: Single IsotonicRegression fit on full-train OOF (used for test submission; slightly optimistic)
- **Honest per-fold isotonic**: Separate IsotonicRegression fit per CV fold on that fold's OOF only (used for candidate gating; conservative)

All candidates must pass the honest per-fold iso gate (Δ ≥ +0.003 vs v43 honest OOF 0.4798) before queueing for submission.

### 6.3 Statistical Rigor

- **Cluster-bootstrap 95% CI**: Bootstrapped at the Butina-cluster level (not compound level) to account for chemical similarity structure
- **Bootstrap iterations**: 1,000 with replacement

---

## 7. Submission History & Performance

| Tag | Date | LB RAE | LB Rank | Key Change |
|---|---|---|---|---|
| v2-baseline-xgb | 2026-05-06 | 0.7412 | 145/199 | XGBoost MAE on FCFP4 + RDKit |
| v3-baseline-lgbm | 2026-05-06 | 0.7249 | 131/199 | LightGBM MSE, Butina CV |
| v4-kitchen-sink | 2026-05-07 | 0.6889 | 112/201 | +Mordred +CheMeleon (4902d) |
| v7-ensemble | 2026-05-07 | 0.6039 | 42/202 | +Chemprop D-MPNN multitask |
| v31-blend | 2026-05-07 | 0.5966 | 42/207 | +AutoGluon-CheMeleon + KERMT + CheMeleon-FT |
| v43-final | 2026-05-08 | 0.586 | ~40/211 | +T1v5 ChEMBL multitask, clean hierarchy |
| v43-defensive | 2026-05-11 | 0.586 | ~40/211 | Defensive re-submit; no improved candidate found |

**CV-to-LB shift trend**: 0.170 → 0.158 → 0.134 → 0.107. The shift narrows as model quality improves.

**Final rank as of 2026-05-11**: 40-42 / 211 (top 19%). Gap to top-25: ~0.013 RAE. Gap to #1 (Yan): ~0.090 RAE.

---

## 8. Exploratory Work (Not in Final Ensemble)

We tested but did not include the following approaches due to honest-CV gate failure:

| Approach | Honest OOF RAE | Reason for Exclusion |
|---|---|---|
| Single-conc multitask (T1v7) | 0.6076 | Diluted primary task; log2fc scale mismatch |
| BindingDB + ChEMBL broad pretrain (T1v9) | 0.5538 | Distribution mismatch; no improvement over T1 |
| MaskMol / MAE ViT-Base | 0.7122 | Too weak solo; zero blend utility |
| TabPFN on CheMeleon | 0.5551 | Spearman vs v43 = 0.931 (too correlated) |
| GIN + ACtriplet (T11) | 0.5839 | Spearman vs v43 = 0.890 (too correlated) |
| Uni-MolV2 310M fine-tune | 0.630 | Poor convergence; high correlation |
| Boltz-2 structural confidence | 0.845 | Confidence scores ≠ binding affinity |
| Differentiable ensemble (14 OOFs) | 0.4966 | Worse than v43 alone; correlated errors |
| ADMET-AI features + LGBM | 0.6206 | Too weak solo; minimal blend utility |
| Tail-weighted LGBM (α=5) | 0.5619 | Marginal improvement on weak baseline |
| Precision-weighted LGBM | 0.5534 | Best non-v43 LGBM; still far from ceiling |
| KERMT higher-weight blend | 0.4939 | Spearman vs v43 = 0.910; no improvement |
| Artifact removal (pEC50 < 2) | 0.5745 | Removing "artifacts" hurt generalization |
| Test-time SMILES augmentation | 0.0376 MAE | No benefit over canonical SMILES |

---

## 9. Key Learnings

1. **Chemprop D-MPNN is the dominant diversity contributor.** All attempts to replace or augment it with transformer-based models (ChemBERTa, GIN, Uni-Mol) converged to Spearman > 0.88 vs the D-MPNN representation, indicating equivalent 2D-graph learning on this dataset.

2. **External data quality > quantity.** ChEMBL broad (2,122 compounds) and BindingDB (364 compounds) did not improve pretrain quality because their pActivity distributions (median ~5.0–6.6) differed from OpenADMET (median ~4.3). Distribution mismatch outweighed volume.

3. **Honest per-fold calibration is essential.** In-sample isotonic calibration overestimates improvement by ~0.01 RAE. Our v50 catastrophic regression (CV 0.454 → LB 0.658) was caught only by the honest-CV gate.

4. **Tail compression is the structural bottleneck.** Our predictions max at ~6.3 vs train max 7.55. Only 11 train compounds (0.3%) have pEC50 ≥ 6.5. No loss-function tweak or ensemble method can extrapolate from 11 examples. Closing the gap to top-25 (Δ = 0.013 LB RAE) would require either proprietary data, 3D structural features, or a genuinely new architecture — each requiring days of setup beyond our time budget.

5. **The ceiling is real.** After 35+ experiments (27 in headless loops + 8 in interactive sessions), every model class converged to Spearman > 0.87 vs v43. The 2D-graph + descriptor space is fully explored. v43 is the best achievable with public data and current infrastructure.

---

## 10. Code and Reproducibility

All code, environment specifications, and trained model checkpoints are published at https://huggingface.co/RyeCatcher/openadmet-pxr-challenge-2026 under Apache 2.0. The repository includes:

- `code/baseline/` — LGBM/XGBoost baselines (v2, v3, v4)
- `code/featurization/` — Mordred, CheMeleon, FCFP4 wrappers
- `code/multitask/T1_chemprop_external_pretrain.py` — Chemprop multitask training
- `code/multitask/T1v5_chemprop_chembl_nr1i2.py` — 5-head Chemprop with ChEMBL NR1I2
- `code/multitask/T2_autogluon_chemeleon.py` — AutoGluon-tabular on CheMeleon
- `code/ensemble/v43_final_blend.py` — Final blend reproducing the submission
- `code/submit/submit_v43.py` — Gradio API submission script
- `models/` — Trained Chemprop and AutoGluon checkpoints (v43 lineage)
- `data/oof_predictions/` — Per-track out-of-fold and test predictions
- `submission/v43_final.parquet` — The 513-row submission
- `code/requirements.txt` — Pinned dependency list

**Dependencies**: Python 3.12, RDKit 2026.03.1, Chemprop 2.2.3, PyTorch 2.11.0+cu130, LightGBM 4.6.0, XGBoost 2.1.4, scikit-learn 1.8.0, pandas 3.0.2

---

## 11. Hardware

Training was performed on an NVIDIA DGX Spark (GB10 Blackwell, 128GB unified memory). Typical wall-clock times:
- Chemprop pretrain (30 epochs): ~3 min
- Chemprop 5-fold finetune: ~8 min
- LightGBM kitchen-sink: ~15 sec
- AutoGluon-CheMeleon: ~3.5 hours

---

*Report finalized 2026-05-11. Best submission: v43 (RAE 0.586 LB, rank ~40/211). Phase 1 deadline: 2026-05-25.*