openadmet-pxr-challenge-2026 / docs /METHODOLOGY_REPORT.md
RyeCatcher's picture
Add files using upload-large-folder tool
1757924 verified
|
raw
history blame
11.7 kB
# Methodology Report β€” OpenADMET PXR Induction Blind Challenge
**Team:** BioInfo (RyeCatcher)
**Track:** Activity Prediction
**Best Submission:** v43 (LB RAE = 0.586, rank ~40/211 as of 2026-05-10)
**Contact:** justin@rundatarun.io
**Code, models, OOF predictions, and submission:** https://huggingface.co/RyeCatcher/openadmet-pxr-challenge-2026
**Report version:** 1.1 (2026-05-19)
---
## 1. Overview
Our approach is an ensemble of three model families: (1) a Chemprop D-MPNN multitask model pretrained on external ADMET data, (2) a gradient-boosted decision tree ensemble on rich molecular descriptors, and (3) an AutoGluon-tabular model on CheMeleon foundation-model embeddings. Final predictions are produced via a weighted blend with isotonic calibration.
---
## 2. Data Used
### 2.1 OpenADMET Provided Data
We used all provided data sources:
| Source | Size | Usage |
|---|---|---|
| DRC pEC50 (train) | 4,139 compounds | Primary regression target |
| Counter-screen pEC50 | 2,647 compounds (subset of train) | Multitask Head 2 |
| Single-concentration log2fc | 21,003 rows (10,870 unique compounds) | Exploratory multitask (not in final ensemble) |
| Test (blinded) | 513 compounds | Prediction target |
### 2.2 External Data
We incorporated the following public datasets for pretraining and auxiliary multitask learning:
| Dataset | Source | Size | Usage |
|---|---|---|---|
| NCATS qHTS PXR (AID 1346982) | PubChem | 9,671 compounds | Binary activity classification head |
| NCATS qHTS LogAC50 (AID 1346985) | PubChem | 2,458 compounds | Regression head |
| Tox21 SR-ARE | MoleculeNet/DeepChem | ~8,000 compounds | Binary classification head (SR-PXR was unavailable in this release) |
| ChEMBL NR1I2 (PXR) | ChEMBL | 907 compounds with pchembl_value | Multitask Head 5 in T1v5 |
| BindingDB PXR (UniProt O75469) | BindingDB REST API | 364 novel compounds | Exploratory pretrain data |
**No proprietary data was used.** All external data was deduplicated against the OpenADMET train and test sets using canonical isomeric SMILES and InChIKey.
---
## 3. Feature Engineering
### 3.1 Molecular Fingerprints
- **FCFP4-count-1024**: RDKit `MorganGenerator` with `MorganFeatureAtomInvGen`, radius=2, fpSize=1024, count fingerprint. This outperformed binary ECFP4 and MACCS in our coverage analysis (90.6% test-to-train Tanimoto β‰₯ 0.4 coverage).
### 3.2 Molecular Descriptors
- **RDKit descriptors**: 217 descriptors from `useful_rdkit_utils.get_rdkit_desc_names()` including physicochemical properties, topological indices, and fragment counts.
- **Mordred**: 1,613 2D/3D descriptors computed with `mordredcommunity`.
- **CheMeleon embeddings**: 2,048-dimension frozen embeddings from the CheMeleon foundation model (Recursion Pharma), extracted via the OpenADMET toolkit.
### 3.3 Learned Embeddings
- **Chemprop D-MPNN fingerprints**: 300-dimension aggregation-layer embeddings extracted from the pretrained multitask model.
---
## 4. Models
### 4.1 Model Family A: Chemprop D-MPNN Multitask (T1)
A directed message-passing neural network (D-MPNN) implemented in Chemprop v2 with multitask pretraining:
- **Architecture**: BondMessagePassing (d_h=300, depth=3) β†’ MeanAggregation β†’ RegressionFFN (4 tasks, 3 hidden layers, dropout=0.1)
- **Heads**:
1. pEC50 regression (OpenADMET DRC)
2. NCATS qHTS binary active/inactive
3. NCATS qHTS LogAC50 regression
4. Tox21 SR-ARE binary classification
- **Pretraining**: 30 epochs on combined dataset (train + external) with equal head weights
- **Finetuning**: 50 epochs with Head 1 weighted 4Γ— on OpenADMET train only
- **Loss**: MSE (binary tasks treated as regression to 0/1 targets)
### 4.2 Model Family B: LightGBM Kitchen-Sink (v4)
- **Features**: 4,902-dimension vector = RDKit 217 + FCFP4-count-1024 + Mordred 1,613 + CheMeleon 2,048
- **Model**: LightGBM with MSE loss, 5-fold CV, early stopping
- **Hyperparameters**: tuned via median fold-best-iteration heuristic for full-train model
### 4.3 Model Family C: AutoGluon on CheMeleon (T2)
- **Features**: CheMeleon 2,048-dimension frozen embeddings
- **Model**: AutoGluon TabularPredictor with `best_quality` preset
- **Runtime**: ~3.5 hours for 5-fold OOF generation
### 4.4 Model Family D: Chemprop + ChEMBL Multitask (T1v5)
Extends T1 with an additional head for ChEMBL NR1I2 PXR pchembl_value (907 compounds). 5-head Chemprop with the same architecture as T1.
---
## 5. Ensemble Strategy
Our final submission (v43) is a **cascade-weighted blend** built iteratively:
```
v7 = 0.50 Γ— T1 + 0.38 Γ— v4 + 0.12 Γ— v3 (isotonic calibrated)
v22 = 0.975 Γ— v7 + 0.025 Γ— KERMT
v26 = 0.90 Γ— v22 + 0.10 Γ— CheMeleon-FT
S6 = 0.90 Γ— v26 + 0.10 Γ— TabPFN
v28 = S6 + T1v2 (CYP3A4 multitask) at w=0.12
v29 = v28 + T16 (cliff-weighted Chemprop) at w=0.10
v31 = 0.83 Γ— v29 + 0.17 Γ— T2 (AutoGluon-CheMeleon)
v43 = 0.78 Γ— v31 + 0.22 Γ— T1v5 (ChEMBL multitask)
```
All weights were determined by grid search maximizing honest OOF RAE improvement with a minimum Ξ” threshold of +0.003 vs the previous step.
**Calibration**: IsotonicRegression (out_of_bounds='clip') applied to blend outputs. For honest CV evaluation, per-fold IsotonicRegression is fitted on each fold's OOF separately to avoid leakage.
---
## 6. Validation Strategy
### 6.1 Cross-Validation
- **Scheme**: Butina clustering at Tanimoto cutoff 0.4 on ECFP4 2048-bit fingerprints, followed by GroupKFold (5 folds)
- **Rationale**: Groups chemically similar compounds together to simulate the challenge's analog-set test construction
- **Primary metric**: RAE (Relative Absolute Error) = Ξ£|y_true - y_pred| / Ξ£|y_true - mean(y_true)|
### 6.2 Honest Calibration Protocol
We distinguish two calibration protocols:
- **In-sample isotonic**: Single IsotonicRegression fit on full-train OOF (used for test submission; slightly optimistic)
- **Honest per-fold isotonic**: Separate IsotonicRegression fit per CV fold on that fold's OOF only (used for candidate gating; conservative)
All candidates must pass the honest per-fold iso gate (Ξ” β‰₯ +0.003 vs v43 honest OOF 0.4798) before queueing for submission.
### 6.3 Statistical Rigor
- **Cluster-bootstrap 95% CI**: Bootstrapped at the Butina-cluster level (not compound level) to account for chemical similarity structure
- **Bootstrap iterations**: 1,000 with replacement
---
## 7. Submission History & Performance
| Tag | Date | LB RAE | LB Rank | Key Change |
|---|---|---|---|---|
| v2-baseline-xgb | 2026-05-06 | 0.7412 | 145/199 | XGBoost MAE on FCFP4 + RDKit |
| v3-baseline-lgbm | 2026-05-06 | 0.7249 | 131/199 | LightGBM MSE, Butina CV |
| v4-kitchen-sink | 2026-05-07 | 0.6889 | 112/201 | +Mordred +CheMeleon (4902d) |
| v7-ensemble | 2026-05-07 | 0.6039 | 42/202 | +Chemprop D-MPNN multitask |
| v31-blend | 2026-05-07 | 0.5966 | 42/207 | +AutoGluon-CheMeleon + KERMT + CheMeleon-FT |
| v43-final | 2026-05-08 | 0.586 | ~40/211 | +T1v5 ChEMBL multitask, clean hierarchy |
| v43-defensive | 2026-05-11 | 0.586 | ~40/211 | Defensive re-submit; no improved candidate found |
**CV-to-LB shift trend**: 0.170 β†’ 0.158 β†’ 0.134 β†’ 0.107. The shift narrows as model quality improves.
**Final rank as of 2026-05-11**: 40-42 / 211 (top 19%). Gap to top-25: ~0.013 RAE. Gap to #1 (Yan): ~0.090 RAE.
---
## 8. Exploratory Work (Not in Final Ensemble)
We tested but did not include the following approaches due to honest-CV gate failure:
| Approach | Honest OOF RAE | Reason for Exclusion |
|---|---|---|
| Single-conc multitask (T1v7) | 0.6076 | Diluted primary task; log2fc scale mismatch |
| BindingDB + ChEMBL broad pretrain (T1v9) | 0.5538 | Distribution mismatch; no improvement over T1 |
| MaskMol / MAE ViT-Base | 0.7122 | Too weak solo; zero blend utility |
| TabPFN on CheMeleon | 0.5551 | Spearman vs v43 = 0.931 (too correlated) |
| GIN + ACtriplet (T11) | 0.5839 | Spearman vs v43 = 0.890 (too correlated) |
| Uni-MolV2 310M fine-tune | 0.630 | Poor convergence; high correlation |
| Boltz-2 structural confidence | 0.845 | Confidence scores β‰  binding affinity |
| Differentiable ensemble (14 OOFs) | 0.4966 | Worse than v43 alone; correlated errors |
| ADMET-AI features + LGBM | 0.6206 | Too weak solo; minimal blend utility |
| Tail-weighted LGBM (Ξ±=5) | 0.5619 | Marginal improvement on weak baseline |
| Precision-weighted LGBM | 0.5534 | Best non-v43 LGBM; still far from ceiling |
| KERMT higher-weight blend | 0.4939 | Spearman vs v43 = 0.910; no improvement |
| Artifact removal (pEC50 < 2) | 0.5745 | Removing "artifacts" hurt generalization |
| Test-time SMILES augmentation | 0.0376 MAE | No benefit over canonical SMILES |
---
## 9. Key Learnings
1. **Chemprop D-MPNN is the dominant diversity contributor.** All attempts to replace or augment it with transformer-based models (ChemBERTa, GIN, Uni-Mol) converged to Spearman > 0.88 vs the D-MPNN representation, indicating equivalent 2D-graph learning on this dataset.
2. **External data quality > quantity.** ChEMBL broad (2,122 compounds) and BindingDB (364 compounds) did not improve pretrain quality because their pActivity distributions (median ~5.0–6.6) differed from OpenADMET (median ~4.3). Distribution mismatch outweighed volume.
3. **Honest per-fold calibration is essential.** In-sample isotonic calibration overestimates improvement by ~0.01 RAE. Our v50 catastrophic regression (CV 0.454 β†’ LB 0.658) was caught only by the honest-CV gate.
4. **Tail compression is the structural bottleneck.** Our predictions max at ~6.3 vs train max 7.55. Only 11 train compounds (0.3%) have pEC50 β‰₯ 6.5. No loss-function tweak or ensemble method can extrapolate from 11 examples. Closing the gap to top-25 (Ξ” = 0.013 LB RAE) would require either proprietary data, 3D structural features, or a genuinely new architecture β€” each requiring days of setup beyond our time budget.
5. **The ceiling is real.** After 35+ experiments (27 in headless loops + 8 in interactive sessions), every model class converged to Spearman > 0.87 vs v43. The 2D-graph + descriptor space is fully explored. v43 is the best achievable with public data and current infrastructure.
---
## 10. Code and Reproducibility
All code, environment specifications, and trained model checkpoints are published at https://huggingface.co/RyeCatcher/openadmet-pxr-challenge-2026 under Apache 2.0. The repository includes:
- `code/baseline/` β€” LGBM/XGBoost baselines (v2, v3, v4)
- `code/featurization/` β€” Mordred, CheMeleon, FCFP4 wrappers
- `code/multitask/T1_chemprop_external_pretrain.py` β€” Chemprop multitask training
- `code/multitask/T1v5_chemprop_chembl_nr1i2.py` β€” 5-head Chemprop with ChEMBL NR1I2
- `code/multitask/T2_autogluon_chemeleon.py` β€” AutoGluon-tabular on CheMeleon
- `code/ensemble/v43_final_blend.py` β€” Final blend reproducing the submission
- `code/submit/submit_v43.py` β€” Gradio API submission script
- `models/` β€” Trained Chemprop and AutoGluon checkpoints (v43 lineage)
- `data/oof_predictions/` β€” Per-track out-of-fold and test predictions
- `submission/v43_final.parquet` β€” The 513-row submission
- `code/requirements.txt` β€” Pinned dependency list
**Dependencies**: Python 3.12, RDKit 2026.03.1, Chemprop 2.2.3, PyTorch 2.11.0+cu130, LightGBM 4.6.0, XGBoost 2.1.4, scikit-learn 1.8.0, pandas 3.0.2
---
## 11. Hardware
Training was performed on an NVIDIA DGX Spark (GB10 Blackwell, 128GB unified memory). Typical wall-clock times:
- Chemprop pretrain (30 epochs): ~3 min
- Chemprop 5-fold finetune: ~8 min
- LightGBM kitchen-sink: ~15 sec
- AutoGluon-CheMeleon: ~3.5 hours
---
*Report finalized 2026-05-11. Best submission: v43 (RAE 0.586 LB, rank ~40/211). Phase 1 deadline: 2026-05-25.*