File size: 11,698 Bytes
1757924 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 | # Methodology Report β OpenADMET PXR Induction Blind Challenge
**Team:** BioInfo (RyeCatcher)
**Track:** Activity Prediction
**Best Submission:** v43 (LB RAE = 0.586, rank ~40/211 as of 2026-05-10)
**Contact:** justin@rundatarun.io
**Code, models, OOF predictions, and submission:** https://huggingface.co/RyeCatcher/openadmet-pxr-challenge-2026
**Report version:** 1.1 (2026-05-19)
---
## 1. Overview
Our approach is an ensemble of three model families: (1) a Chemprop D-MPNN multitask model pretrained on external ADMET data, (2) a gradient-boosted decision tree ensemble on rich molecular descriptors, and (3) an AutoGluon-tabular model on CheMeleon foundation-model embeddings. Final predictions are produced via a weighted blend with isotonic calibration.
---
## 2. Data Used
### 2.1 OpenADMET Provided Data
We used all provided data sources:
| Source | Size | Usage |
|---|---|---|
| DRC pEC50 (train) | 4,139 compounds | Primary regression target |
| Counter-screen pEC50 | 2,647 compounds (subset of train) | Multitask Head 2 |
| Single-concentration log2fc | 21,003 rows (10,870 unique compounds) | Exploratory multitask (not in final ensemble) |
| Test (blinded) | 513 compounds | Prediction target |
### 2.2 External Data
We incorporated the following public datasets for pretraining and auxiliary multitask learning:
| Dataset | Source | Size | Usage |
|---|---|---|---|
| NCATS qHTS PXR (AID 1346982) | PubChem | 9,671 compounds | Binary activity classification head |
| NCATS qHTS LogAC50 (AID 1346985) | PubChem | 2,458 compounds | Regression head |
| Tox21 SR-ARE | MoleculeNet/DeepChem | ~8,000 compounds | Binary classification head (SR-PXR was unavailable in this release) |
| ChEMBL NR1I2 (PXR) | ChEMBL | 907 compounds with pchembl_value | Multitask Head 5 in T1v5 |
| BindingDB PXR (UniProt O75469) | BindingDB REST API | 364 novel compounds | Exploratory pretrain data |
**No proprietary data was used.** All external data was deduplicated against the OpenADMET train and test sets using canonical isomeric SMILES and InChIKey.
---
## 3. Feature Engineering
### 3.1 Molecular Fingerprints
- **FCFP4-count-1024**: RDKit `MorganGenerator` with `MorganFeatureAtomInvGen`, radius=2, fpSize=1024, count fingerprint. This outperformed binary ECFP4 and MACCS in our coverage analysis (90.6% test-to-train Tanimoto β₯ 0.4 coverage).
### 3.2 Molecular Descriptors
- **RDKit descriptors**: 217 descriptors from `useful_rdkit_utils.get_rdkit_desc_names()` including physicochemical properties, topological indices, and fragment counts.
- **Mordred**: 1,613 2D/3D descriptors computed with `mordredcommunity`.
- **CheMeleon embeddings**: 2,048-dimension frozen embeddings from the CheMeleon foundation model (Recursion Pharma), extracted via the OpenADMET toolkit.
### 3.3 Learned Embeddings
- **Chemprop D-MPNN fingerprints**: 300-dimension aggregation-layer embeddings extracted from the pretrained multitask model.
---
## 4. Models
### 4.1 Model Family A: Chemprop D-MPNN Multitask (T1)
A directed message-passing neural network (D-MPNN) implemented in Chemprop v2 with multitask pretraining:
- **Architecture**: BondMessagePassing (d_h=300, depth=3) β MeanAggregation β RegressionFFN (4 tasks, 3 hidden layers, dropout=0.1)
- **Heads**:
1. pEC50 regression (OpenADMET DRC)
2. NCATS qHTS binary active/inactive
3. NCATS qHTS LogAC50 regression
4. Tox21 SR-ARE binary classification
- **Pretraining**: 30 epochs on combined dataset (train + external) with equal head weights
- **Finetuning**: 50 epochs with Head 1 weighted 4Γ on OpenADMET train only
- **Loss**: MSE (binary tasks treated as regression to 0/1 targets)
### 4.2 Model Family B: LightGBM Kitchen-Sink (v4)
- **Features**: 4,902-dimension vector = RDKit 217 + FCFP4-count-1024 + Mordred 1,613 + CheMeleon 2,048
- **Model**: LightGBM with MSE loss, 5-fold CV, early stopping
- **Hyperparameters**: tuned via median fold-best-iteration heuristic for full-train model
### 4.3 Model Family C: AutoGluon on CheMeleon (T2)
- **Features**: CheMeleon 2,048-dimension frozen embeddings
- **Model**: AutoGluon TabularPredictor with `best_quality` preset
- **Runtime**: ~3.5 hours for 5-fold OOF generation
### 4.4 Model Family D: Chemprop + ChEMBL Multitask (T1v5)
Extends T1 with an additional head for ChEMBL NR1I2 PXR pchembl_value (907 compounds). 5-head Chemprop with the same architecture as T1.
---
## 5. Ensemble Strategy
Our final submission (v43) is a **cascade-weighted blend** built iteratively:
```
v7 = 0.50 Γ T1 + 0.38 Γ v4 + 0.12 Γ v3 (isotonic calibrated)
v22 = 0.975 Γ v7 + 0.025 Γ KERMT
v26 = 0.90 Γ v22 + 0.10 Γ CheMeleon-FT
S6 = 0.90 Γ v26 + 0.10 Γ TabPFN
v28 = S6 + T1v2 (CYP3A4 multitask) at w=0.12
v29 = v28 + T16 (cliff-weighted Chemprop) at w=0.10
v31 = 0.83 Γ v29 + 0.17 Γ T2 (AutoGluon-CheMeleon)
v43 = 0.78 Γ v31 + 0.22 Γ T1v5 (ChEMBL multitask)
```
All weights were determined by grid search maximizing honest OOF RAE improvement with a minimum Ξ threshold of +0.003 vs the previous step.
**Calibration**: IsotonicRegression (out_of_bounds='clip') applied to blend outputs. For honest CV evaluation, per-fold IsotonicRegression is fitted on each fold's OOF separately to avoid leakage.
---
## 6. Validation Strategy
### 6.1 Cross-Validation
- **Scheme**: Butina clustering at Tanimoto cutoff 0.4 on ECFP4 2048-bit fingerprints, followed by GroupKFold (5 folds)
- **Rationale**: Groups chemically similar compounds together to simulate the challenge's analog-set test construction
- **Primary metric**: RAE (Relative Absolute Error) = Ξ£|y_true - y_pred| / Ξ£|y_true - mean(y_true)|
### 6.2 Honest Calibration Protocol
We distinguish two calibration protocols:
- **In-sample isotonic**: Single IsotonicRegression fit on full-train OOF (used for test submission; slightly optimistic)
- **Honest per-fold isotonic**: Separate IsotonicRegression fit per CV fold on that fold's OOF only (used for candidate gating; conservative)
All candidates must pass the honest per-fold iso gate (Ξ β₯ +0.003 vs v43 honest OOF 0.4798) before queueing for submission.
### 6.3 Statistical Rigor
- **Cluster-bootstrap 95% CI**: Bootstrapped at the Butina-cluster level (not compound level) to account for chemical similarity structure
- **Bootstrap iterations**: 1,000 with replacement
---
## 7. Submission History & Performance
| Tag | Date | LB RAE | LB Rank | Key Change |
|---|---|---|---|---|
| v2-baseline-xgb | 2026-05-06 | 0.7412 | 145/199 | XGBoost MAE on FCFP4 + RDKit |
| v3-baseline-lgbm | 2026-05-06 | 0.7249 | 131/199 | LightGBM MSE, Butina CV |
| v4-kitchen-sink | 2026-05-07 | 0.6889 | 112/201 | +Mordred +CheMeleon (4902d) |
| v7-ensemble | 2026-05-07 | 0.6039 | 42/202 | +Chemprop D-MPNN multitask |
| v31-blend | 2026-05-07 | 0.5966 | 42/207 | +AutoGluon-CheMeleon + KERMT + CheMeleon-FT |
| v43-final | 2026-05-08 | 0.586 | ~40/211 | +T1v5 ChEMBL multitask, clean hierarchy |
| v43-defensive | 2026-05-11 | 0.586 | ~40/211 | Defensive re-submit; no improved candidate found |
**CV-to-LB shift trend**: 0.170 β 0.158 β 0.134 β 0.107. The shift narrows as model quality improves.
**Final rank as of 2026-05-11**: 40-42 / 211 (top 19%). Gap to top-25: ~0.013 RAE. Gap to #1 (Yan): ~0.090 RAE.
---
## 8. Exploratory Work (Not in Final Ensemble)
We tested but did not include the following approaches due to honest-CV gate failure:
| Approach | Honest OOF RAE | Reason for Exclusion |
|---|---|---|
| Single-conc multitask (T1v7) | 0.6076 | Diluted primary task; log2fc scale mismatch |
| BindingDB + ChEMBL broad pretrain (T1v9) | 0.5538 | Distribution mismatch; no improvement over T1 |
| MaskMol / MAE ViT-Base | 0.7122 | Too weak solo; zero blend utility |
| TabPFN on CheMeleon | 0.5551 | Spearman vs v43 = 0.931 (too correlated) |
| GIN + ACtriplet (T11) | 0.5839 | Spearman vs v43 = 0.890 (too correlated) |
| Uni-MolV2 310M fine-tune | 0.630 | Poor convergence; high correlation |
| Boltz-2 structural confidence | 0.845 | Confidence scores β binding affinity |
| Differentiable ensemble (14 OOFs) | 0.4966 | Worse than v43 alone; correlated errors |
| ADMET-AI features + LGBM | 0.6206 | Too weak solo; minimal blend utility |
| Tail-weighted LGBM (Ξ±=5) | 0.5619 | Marginal improvement on weak baseline |
| Precision-weighted LGBM | 0.5534 | Best non-v43 LGBM; still far from ceiling |
| KERMT higher-weight blend | 0.4939 | Spearman vs v43 = 0.910; no improvement |
| Artifact removal (pEC50 < 2) | 0.5745 | Removing "artifacts" hurt generalization |
| Test-time SMILES augmentation | 0.0376 MAE | No benefit over canonical SMILES |
---
## 9. Key Learnings
1. **Chemprop D-MPNN is the dominant diversity contributor.** All attempts to replace or augment it with transformer-based models (ChemBERTa, GIN, Uni-Mol) converged to Spearman > 0.88 vs the D-MPNN representation, indicating equivalent 2D-graph learning on this dataset.
2. **External data quality > quantity.** ChEMBL broad (2,122 compounds) and BindingDB (364 compounds) did not improve pretrain quality because their pActivity distributions (median ~5.0β6.6) differed from OpenADMET (median ~4.3). Distribution mismatch outweighed volume.
3. **Honest per-fold calibration is essential.** In-sample isotonic calibration overestimates improvement by ~0.01 RAE. Our v50 catastrophic regression (CV 0.454 β LB 0.658) was caught only by the honest-CV gate.
4. **Tail compression is the structural bottleneck.** Our predictions max at ~6.3 vs train max 7.55. Only 11 train compounds (0.3%) have pEC50 β₯ 6.5. No loss-function tweak or ensemble method can extrapolate from 11 examples. Closing the gap to top-25 (Ξ = 0.013 LB RAE) would require either proprietary data, 3D structural features, or a genuinely new architecture β each requiring days of setup beyond our time budget.
5. **The ceiling is real.** After 35+ experiments (27 in headless loops + 8 in interactive sessions), every model class converged to Spearman > 0.87 vs v43. The 2D-graph + descriptor space is fully explored. v43 is the best achievable with public data and current infrastructure.
---
## 10. Code and Reproducibility
All code, environment specifications, and trained model checkpoints are published at https://huggingface.co/RyeCatcher/openadmet-pxr-challenge-2026 under Apache 2.0. The repository includes:
- `code/baseline/` β LGBM/XGBoost baselines (v2, v3, v4)
- `code/featurization/` β Mordred, CheMeleon, FCFP4 wrappers
- `code/multitask/T1_chemprop_external_pretrain.py` β Chemprop multitask training
- `code/multitask/T1v5_chemprop_chembl_nr1i2.py` β 5-head Chemprop with ChEMBL NR1I2
- `code/multitask/T2_autogluon_chemeleon.py` β AutoGluon-tabular on CheMeleon
- `code/ensemble/v43_final_blend.py` β Final blend reproducing the submission
- `code/submit/submit_v43.py` β Gradio API submission script
- `models/` β Trained Chemprop and AutoGluon checkpoints (v43 lineage)
- `data/oof_predictions/` β Per-track out-of-fold and test predictions
- `submission/v43_final.parquet` β The 513-row submission
- `code/requirements.txt` β Pinned dependency list
**Dependencies**: Python 3.12, RDKit 2026.03.1, Chemprop 2.2.3, PyTorch 2.11.0+cu130, LightGBM 4.6.0, XGBoost 2.1.4, scikit-learn 1.8.0, pandas 3.0.2
---
## 11. Hardware
Training was performed on an NVIDIA DGX Spark (GB10 Blackwell, 128GB unified memory). Typical wall-clock times:
- Chemprop pretrain (30 epochs): ~3 min
- Chemprop 5-fold finetune: ~8 min
- LightGBM kitchen-sink: ~15 sec
- AutoGluon-CheMeleon: ~3.5 hours
---
*Report finalized 2026-05-11. Best submission: v43 (RAE 0.586 LB, rank ~40/211). Phase 1 deadline: 2026-05-25.*
|