# Methodology Report — OpenADMET PXR Induction Blind Challenge **Team:** BioInfo (RyeCatcher) **Track:** Activity Prediction **Best Submission:** v43 (LB RAE = 0.586, rank ~40/211 as of 2026-05-10) **Contact:** justin@rundatarun.io **Code, models, OOF predictions, and submission:** https://huggingface.co/RyeCatcher/openadmet-pxr-challenge-2026 **Report version:** 1.1 (2026-05-19) --- ## 1. Overview Our approach is an ensemble of three model families: (1) a Chemprop D-MPNN multitask model pretrained on external ADMET data, (2) a gradient-boosted decision tree ensemble on rich molecular descriptors, and (3) an AutoGluon-tabular model on CheMeleon foundation-model embeddings. Final predictions are produced via a weighted blend with isotonic calibration. --- ## 2. Data Used ### 2.1 OpenADMET Provided Data We used all provided data sources: | Source | Size | Usage | |---|---|---| | DRC pEC50 (train) | 4,139 compounds | Primary regression target | | Counter-screen pEC50 | 2,647 compounds (subset of train) | Multitask Head 2 | | Single-concentration log2fc | 21,003 rows (10,870 unique compounds) | Exploratory multitask (not in final ensemble) | | Test (blinded) | 513 compounds | Prediction target | ### 2.2 External Data We incorporated the following public datasets for pretraining and auxiliary multitask learning: | Dataset | Source | Size | Usage | |---|---|---|---| | NCATS qHTS PXR (AID 1346982) | PubChem | 9,671 compounds | Binary activity classification head | | NCATS qHTS LogAC50 (AID 1346985) | PubChem | 2,458 compounds | Regression head | | Tox21 SR-ARE | MoleculeNet/DeepChem | ~8,000 compounds | Binary classification head (SR-PXR was unavailable in this release) | | ChEMBL NR1I2 (PXR) | ChEMBL | 907 compounds with pchembl_value | Multitask Head 5 in T1v5 | | BindingDB PXR (UniProt O75469) | BindingDB REST API | 364 novel compounds | Exploratory pretrain data | **No proprietary data was used.** All external data was deduplicated against the OpenADMET train and test sets using canonical isomeric SMILES and InChIKey. --- ## 3. Feature Engineering ### 3.1 Molecular Fingerprints - **FCFP4-count-1024**: RDKit `MorganGenerator` with `MorganFeatureAtomInvGen`, radius=2, fpSize=1024, count fingerprint. This outperformed binary ECFP4 and MACCS in our coverage analysis (90.6% test-to-train Tanimoto ≥ 0.4 coverage). ### 3.2 Molecular Descriptors - **RDKit descriptors**: 217 descriptors from `useful_rdkit_utils.get_rdkit_desc_names()` including physicochemical properties, topological indices, and fragment counts. - **Mordred**: 1,613 2D/3D descriptors computed with `mordredcommunity`. - **CheMeleon embeddings**: 2,048-dimension frozen embeddings from the CheMeleon foundation model (Recursion Pharma), extracted via the OpenADMET toolkit. ### 3.3 Learned Embeddings - **Chemprop D-MPNN fingerprints**: 300-dimension aggregation-layer embeddings extracted from the pretrained multitask model. --- ## 4. Models ### 4.1 Model Family A: Chemprop D-MPNN Multitask (T1) A directed message-passing neural network (D-MPNN) implemented in Chemprop v2 with multitask pretraining: - **Architecture**: BondMessagePassing (d_h=300, depth=3) → MeanAggregation → RegressionFFN (4 tasks, 3 hidden layers, dropout=0.1) - **Heads**: 1. pEC50 regression (OpenADMET DRC) 2. NCATS qHTS binary active/inactive 3. NCATS qHTS LogAC50 regression 4. Tox21 SR-ARE binary classification - **Pretraining**: 30 epochs on combined dataset (train + external) with equal head weights - **Finetuning**: 50 epochs with Head 1 weighted 4× on OpenADMET train only - **Loss**: MSE (binary tasks treated as regression to 0/1 targets) ### 4.2 Model Family B: LightGBM Kitchen-Sink (v4) - **Features**: 4,902-dimension vector = RDKit 217 + FCFP4-count-1024 + Mordred 1,613 + CheMeleon 2,048 - **Model**: LightGBM with MSE loss, 5-fold CV, early stopping - **Hyperparameters**: tuned via median fold-best-iteration heuristic for full-train model ### 4.3 Model Family C: AutoGluon on CheMeleon (T2) - **Features**: CheMeleon 2,048-dimension frozen embeddings - **Model**: AutoGluon TabularPredictor with `best_quality` preset - **Runtime**: ~3.5 hours for 5-fold OOF generation ### 4.4 Model Family D: Chemprop + ChEMBL Multitask (T1v5) Extends T1 with an additional head for ChEMBL NR1I2 PXR pchembl_value (907 compounds). 5-head Chemprop with the same architecture as T1. --- ## 5. Ensemble Strategy Our final submission (v43) is a **cascade-weighted blend** built iteratively: ``` v7 = 0.50 × T1 + 0.38 × v4 + 0.12 × v3 (isotonic calibrated) v22 = 0.975 × v7 + 0.025 × KERMT v26 = 0.90 × v22 + 0.10 × CheMeleon-FT S6 = 0.90 × v26 + 0.10 × TabPFN v28 = S6 + T1v2 (CYP3A4 multitask) at w=0.12 v29 = v28 + T16 (cliff-weighted Chemprop) at w=0.10 v31 = 0.83 × v29 + 0.17 × T2 (AutoGluon-CheMeleon) v43 = 0.78 × v31 + 0.22 × T1v5 (ChEMBL multitask) ``` All weights were determined by grid search maximizing honest OOF RAE improvement with a minimum Δ threshold of +0.003 vs the previous step. **Calibration**: IsotonicRegression (out_of_bounds='clip') applied to blend outputs. For honest CV evaluation, per-fold IsotonicRegression is fitted on each fold's OOF separately to avoid leakage. --- ## 6. Validation Strategy ### 6.1 Cross-Validation - **Scheme**: Butina clustering at Tanimoto cutoff 0.4 on ECFP4 2048-bit fingerprints, followed by GroupKFold (5 folds) - **Rationale**: Groups chemically similar compounds together to simulate the challenge's analog-set test construction - **Primary metric**: RAE (Relative Absolute Error) = Σ|y_true - y_pred| / Σ|y_true - mean(y_true)| ### 6.2 Honest Calibration Protocol We distinguish two calibration protocols: - **In-sample isotonic**: Single IsotonicRegression fit on full-train OOF (used for test submission; slightly optimistic) - **Honest per-fold isotonic**: Separate IsotonicRegression fit per CV fold on that fold's OOF only (used for candidate gating; conservative) All candidates must pass the honest per-fold iso gate (Δ ≥ +0.003 vs v43 honest OOF 0.4798) before queueing for submission. ### 6.3 Statistical Rigor - **Cluster-bootstrap 95% CI**: Bootstrapped at the Butina-cluster level (not compound level) to account for chemical similarity structure - **Bootstrap iterations**: 1,000 with replacement --- ## 7. Submission History & Performance | Tag | Date | LB RAE | LB Rank | Key Change | |---|---|---|---|---| | v2-baseline-xgb | 2026-05-06 | 0.7412 | 145/199 | XGBoost MAE on FCFP4 + RDKit | | v3-baseline-lgbm | 2026-05-06 | 0.7249 | 131/199 | LightGBM MSE, Butina CV | | v4-kitchen-sink | 2026-05-07 | 0.6889 | 112/201 | +Mordred +CheMeleon (4902d) | | v7-ensemble | 2026-05-07 | 0.6039 | 42/202 | +Chemprop D-MPNN multitask | | v31-blend | 2026-05-07 | 0.5966 | 42/207 | +AutoGluon-CheMeleon + KERMT + CheMeleon-FT | | v43-final | 2026-05-08 | 0.586 | ~40/211 | +T1v5 ChEMBL multitask, clean hierarchy | | v43-defensive | 2026-05-11 | 0.586 | ~40/211 | Defensive re-submit; no improved candidate found | **CV-to-LB shift trend**: 0.170 → 0.158 → 0.134 → 0.107. The shift narrows as model quality improves. **Final rank as of 2026-05-11**: 40-42 / 211 (top 19%). Gap to top-25: ~0.013 RAE. Gap to #1 (Yan): ~0.090 RAE. --- ## 8. Exploratory Work (Not in Final Ensemble) We tested but did not include the following approaches due to honest-CV gate failure: | Approach | Honest OOF RAE | Reason for Exclusion | |---|---|---| | Single-conc multitask (T1v7) | 0.6076 | Diluted primary task; log2fc scale mismatch | | BindingDB + ChEMBL broad pretrain (T1v9) | 0.5538 | Distribution mismatch; no improvement over T1 | | MaskMol / MAE ViT-Base | 0.7122 | Too weak solo; zero blend utility | | TabPFN on CheMeleon | 0.5551 | Spearman vs v43 = 0.931 (too correlated) | | GIN + ACtriplet (T11) | 0.5839 | Spearman vs v43 = 0.890 (too correlated) | | Uni-MolV2 310M fine-tune | 0.630 | Poor convergence; high correlation | | Boltz-2 structural confidence | 0.845 | Confidence scores ≠ binding affinity | | Differentiable ensemble (14 OOFs) | 0.4966 | Worse than v43 alone; correlated errors | | ADMET-AI features + LGBM | 0.6206 | Too weak solo; minimal blend utility | | Tail-weighted LGBM (α=5) | 0.5619 | Marginal improvement on weak baseline | | Precision-weighted LGBM | 0.5534 | Best non-v43 LGBM; still far from ceiling | | KERMT higher-weight blend | 0.4939 | Spearman vs v43 = 0.910; no improvement | | Artifact removal (pEC50 < 2) | 0.5745 | Removing "artifacts" hurt generalization | | Test-time SMILES augmentation | 0.0376 MAE | No benefit over canonical SMILES | --- ## 9. Key Learnings 1. **Chemprop D-MPNN is the dominant diversity contributor.** All attempts to replace or augment it with transformer-based models (ChemBERTa, GIN, Uni-Mol) converged to Spearman > 0.88 vs the D-MPNN representation, indicating equivalent 2D-graph learning on this dataset. 2. **External data quality > quantity.** ChEMBL broad (2,122 compounds) and BindingDB (364 compounds) did not improve pretrain quality because their pActivity distributions (median ~5.0–6.6) differed from OpenADMET (median ~4.3). Distribution mismatch outweighed volume. 3. **Honest per-fold calibration is essential.** In-sample isotonic calibration overestimates improvement by ~0.01 RAE. Our v50 catastrophic regression (CV 0.454 → LB 0.658) was caught only by the honest-CV gate. 4. **Tail compression is the structural bottleneck.** Our predictions max at ~6.3 vs train max 7.55. Only 11 train compounds (0.3%) have pEC50 ≥ 6.5. No loss-function tweak or ensemble method can extrapolate from 11 examples. Closing the gap to top-25 (Δ = 0.013 LB RAE) would require either proprietary data, 3D structural features, or a genuinely new architecture — each requiring days of setup beyond our time budget. 5. **The ceiling is real.** After 35+ experiments (27 in headless loops + 8 in interactive sessions), every model class converged to Spearman > 0.87 vs v43. The 2D-graph + descriptor space is fully explored. v43 is the best achievable with public data and current infrastructure. --- ## 10. Code and Reproducibility All code, environment specifications, and trained model checkpoints are published at https://huggingface.co/RyeCatcher/openadmet-pxr-challenge-2026 under Apache 2.0. The repository includes: - `code/baseline/` — LGBM/XGBoost baselines (v2, v3, v4) - `code/featurization/` — Mordred, CheMeleon, FCFP4 wrappers - `code/multitask/T1_chemprop_external_pretrain.py` — Chemprop multitask training - `code/multitask/T1v5_chemprop_chembl_nr1i2.py` — 5-head Chemprop with ChEMBL NR1I2 - `code/multitask/T2_autogluon_chemeleon.py` — AutoGluon-tabular on CheMeleon - `code/ensemble/v43_final_blend.py` — Final blend reproducing the submission - `code/submit/submit_v43.py` — Gradio API submission script - `models/` — Trained Chemprop and AutoGluon checkpoints (v43 lineage) - `data/oof_predictions/` — Per-track out-of-fold and test predictions - `submission/v43_final.parquet` — The 513-row submission - `code/requirements.txt` — Pinned dependency list **Dependencies**: Python 3.12, RDKit 2026.03.1, Chemprop 2.2.3, PyTorch 2.11.0+cu130, LightGBM 4.6.0, XGBoost 2.1.4, scikit-learn 1.8.0, pandas 3.0.2 --- ## 11. Hardware Training was performed on an NVIDIA DGX Spark (GB10 Blackwell, 128GB unified memory). Typical wall-clock times: - Chemprop pretrain (30 epochs): ~3 min - Chemprop 5-fold finetune: ~8 min - LightGBM kitchen-sink: ~15 sec - AutoGluon-CheMeleon: ~3.5 hours --- *Report finalized 2026-05-11. Best submission: v43 (RAE 0.586 LB, rank ~40/211). Phase 1 deadline: 2026-05-25.*