openadmet-pxr-challenge-2026 / docs /METHODOLOGY_REPORT.md
RyeCatcher's picture
Add files using upload-large-folder tool
1757924 verified
|
raw
history blame
11.7 kB

Methodology Report β€” OpenADMET PXR Induction Blind Challenge

Team: BioInfo (RyeCatcher)
Track: Activity Prediction
Best Submission: v43 (LB RAE = 0.586, rank ~40/211 as of 2026-05-10)
Contact: justin@rundatarun.io
Code, models, OOF predictions, and submission: https://huggingface.co/RyeCatcher/openadmet-pxr-challenge-2026 Report version: 1.1 (2026-05-19)


1. Overview

Our approach is an ensemble of three model families: (1) a Chemprop D-MPNN multitask model pretrained on external ADMET data, (2) a gradient-boosted decision tree ensemble on rich molecular descriptors, and (3) an AutoGluon-tabular model on CheMeleon foundation-model embeddings. Final predictions are produced via a weighted blend with isotonic calibration.


2. Data Used

2.1 OpenADMET Provided Data

We used all provided data sources:

Source Size Usage
DRC pEC50 (train) 4,139 compounds Primary regression target
Counter-screen pEC50 2,647 compounds (subset of train) Multitask Head 2
Single-concentration log2fc 21,003 rows (10,870 unique compounds) Exploratory multitask (not in final ensemble)
Test (blinded) 513 compounds Prediction target

2.2 External Data

We incorporated the following public datasets for pretraining and auxiliary multitask learning:

Dataset Source Size Usage
NCATS qHTS PXR (AID 1346982) PubChem 9,671 compounds Binary activity classification head
NCATS qHTS LogAC50 (AID 1346985) PubChem 2,458 compounds Regression head
Tox21 SR-ARE MoleculeNet/DeepChem ~8,000 compounds Binary classification head (SR-PXR was unavailable in this release)
ChEMBL NR1I2 (PXR) ChEMBL 907 compounds with pchembl_value Multitask Head 5 in T1v5
BindingDB PXR (UniProt O75469) BindingDB REST API 364 novel compounds Exploratory pretrain data

No proprietary data was used. All external data was deduplicated against the OpenADMET train and test sets using canonical isomeric SMILES and InChIKey.


3. Feature Engineering

3.1 Molecular Fingerprints

  • FCFP4-count-1024: RDKit MorganGenerator with MorganFeatureAtomInvGen, radius=2, fpSize=1024, count fingerprint. This outperformed binary ECFP4 and MACCS in our coverage analysis (90.6% test-to-train Tanimoto β‰₯ 0.4 coverage).

3.2 Molecular Descriptors

  • RDKit descriptors: 217 descriptors from useful_rdkit_utils.get_rdkit_desc_names() including physicochemical properties, topological indices, and fragment counts.
  • Mordred: 1,613 2D/3D descriptors computed with mordredcommunity.
  • CheMeleon embeddings: 2,048-dimension frozen embeddings from the CheMeleon foundation model (Recursion Pharma), extracted via the OpenADMET toolkit.

3.3 Learned Embeddings

  • Chemprop D-MPNN fingerprints: 300-dimension aggregation-layer embeddings extracted from the pretrained multitask model.

4. Models

4.1 Model Family A: Chemprop D-MPNN Multitask (T1)

A directed message-passing neural network (D-MPNN) implemented in Chemprop v2 with multitask pretraining:

  • Architecture: BondMessagePassing (d_h=300, depth=3) β†’ MeanAggregation β†’ RegressionFFN (4 tasks, 3 hidden layers, dropout=0.1)
  • Heads:
    1. pEC50 regression (OpenADMET DRC)
    2. NCATS qHTS binary active/inactive
    3. NCATS qHTS LogAC50 regression
    4. Tox21 SR-ARE binary classification
  • Pretraining: 30 epochs on combined dataset (train + external) with equal head weights
  • Finetuning: 50 epochs with Head 1 weighted 4Γ— on OpenADMET train only
  • Loss: MSE (binary tasks treated as regression to 0/1 targets)

4.2 Model Family B: LightGBM Kitchen-Sink (v4)

  • Features: 4,902-dimension vector = RDKit 217 + FCFP4-count-1024 + Mordred 1,613 + CheMeleon 2,048
  • Model: LightGBM with MSE loss, 5-fold CV, early stopping
  • Hyperparameters: tuned via median fold-best-iteration heuristic for full-train model

4.3 Model Family C: AutoGluon on CheMeleon (T2)

  • Features: CheMeleon 2,048-dimension frozen embeddings
  • Model: AutoGluon TabularPredictor with best_quality preset
  • Runtime: ~3.5 hours for 5-fold OOF generation

4.4 Model Family D: Chemprop + ChEMBL Multitask (T1v5)

Extends T1 with an additional head for ChEMBL NR1I2 PXR pchembl_value (907 compounds). 5-head Chemprop with the same architecture as T1.


5. Ensemble Strategy

Our final submission (v43) is a cascade-weighted blend built iteratively:

v7   = 0.50 Γ— T1 + 0.38 Γ— v4 + 0.12 Γ— v3  (isotonic calibrated)
v22  = 0.975 Γ— v7 + 0.025 Γ— KERMT
v26  = 0.90 Γ— v22 + 0.10 Γ— CheMeleon-FT
S6   = 0.90 Γ— v26 + 0.10 Γ— TabPFN
v28  = S6 + T1v2 (CYP3A4 multitask) at w=0.12
v29  = v28 + T16 (cliff-weighted Chemprop) at w=0.10
v31  = 0.83 Γ— v29 + 0.17 Γ— T2 (AutoGluon-CheMeleon)
v43  = 0.78 Γ— v31 + 0.22 Γ— T1v5 (ChEMBL multitask)

All weights were determined by grid search maximizing honest OOF RAE improvement with a minimum Ξ” threshold of +0.003 vs the previous step.

Calibration: IsotonicRegression (out_of_bounds='clip') applied to blend outputs. For honest CV evaluation, per-fold IsotonicRegression is fitted on each fold's OOF separately to avoid leakage.


6. Validation Strategy

6.1 Cross-Validation

  • Scheme: Butina clustering at Tanimoto cutoff 0.4 on ECFP4 2048-bit fingerprints, followed by GroupKFold (5 folds)
  • Rationale: Groups chemically similar compounds together to simulate the challenge's analog-set test construction
  • Primary metric: RAE (Relative Absolute Error) = Ξ£|y_true - y_pred| / Ξ£|y_true - mean(y_true)|

6.2 Honest Calibration Protocol

We distinguish two calibration protocols:

  • In-sample isotonic: Single IsotonicRegression fit on full-train OOF (used for test submission; slightly optimistic)
  • Honest per-fold isotonic: Separate IsotonicRegression fit per CV fold on that fold's OOF only (used for candidate gating; conservative)

All candidates must pass the honest per-fold iso gate (Ξ” β‰₯ +0.003 vs v43 honest OOF 0.4798) before queueing for submission.

6.3 Statistical Rigor

  • Cluster-bootstrap 95% CI: Bootstrapped at the Butina-cluster level (not compound level) to account for chemical similarity structure
  • Bootstrap iterations: 1,000 with replacement

7. Submission History & Performance

Tag Date LB RAE LB Rank Key Change
v2-baseline-xgb 2026-05-06 0.7412 145/199 XGBoost MAE on FCFP4 + RDKit
v3-baseline-lgbm 2026-05-06 0.7249 131/199 LightGBM MSE, Butina CV
v4-kitchen-sink 2026-05-07 0.6889 112/201 +Mordred +CheMeleon (4902d)
v7-ensemble 2026-05-07 0.6039 42/202 +Chemprop D-MPNN multitask
v31-blend 2026-05-07 0.5966 42/207 +AutoGluon-CheMeleon + KERMT + CheMeleon-FT
v43-final 2026-05-08 0.586 ~40/211 +T1v5 ChEMBL multitask, clean hierarchy
v43-defensive 2026-05-11 0.586 ~40/211 Defensive re-submit; no improved candidate found

CV-to-LB shift trend: 0.170 β†’ 0.158 β†’ 0.134 β†’ 0.107. The shift narrows as model quality improves.

Final rank as of 2026-05-11: 40-42 / 211 (top 19%). Gap to top-25: ~0.013 RAE. Gap to #1 (Yan): ~0.090 RAE.


8. Exploratory Work (Not in Final Ensemble)

We tested but did not include the following approaches due to honest-CV gate failure:

Approach Honest OOF RAE Reason for Exclusion
Single-conc multitask (T1v7) 0.6076 Diluted primary task; log2fc scale mismatch
BindingDB + ChEMBL broad pretrain (T1v9) 0.5538 Distribution mismatch; no improvement over T1
MaskMol / MAE ViT-Base 0.7122 Too weak solo; zero blend utility
TabPFN on CheMeleon 0.5551 Spearman vs v43 = 0.931 (too correlated)
GIN + ACtriplet (T11) 0.5839 Spearman vs v43 = 0.890 (too correlated)
Uni-MolV2 310M fine-tune 0.630 Poor convergence; high correlation
Boltz-2 structural confidence 0.845 Confidence scores β‰  binding affinity
Differentiable ensemble (14 OOFs) 0.4966 Worse than v43 alone; correlated errors
ADMET-AI features + LGBM 0.6206 Too weak solo; minimal blend utility
Tail-weighted LGBM (Ξ±=5) 0.5619 Marginal improvement on weak baseline
Precision-weighted LGBM 0.5534 Best non-v43 LGBM; still far from ceiling
KERMT higher-weight blend 0.4939 Spearman vs v43 = 0.910; no improvement
Artifact removal (pEC50 < 2) 0.5745 Removing "artifacts" hurt generalization
Test-time SMILES augmentation 0.0376 MAE No benefit over canonical SMILES

9. Key Learnings

  1. Chemprop D-MPNN is the dominant diversity contributor. All attempts to replace or augment it with transformer-based models (ChemBERTa, GIN, Uni-Mol) converged to Spearman > 0.88 vs the D-MPNN representation, indicating equivalent 2D-graph learning on this dataset.

  2. External data quality > quantity. ChEMBL broad (2,122 compounds) and BindingDB (364 compounds) did not improve pretrain quality because their pActivity distributions (median ~5.0–6.6) differed from OpenADMET (median ~4.3). Distribution mismatch outweighed volume.

  3. Honest per-fold calibration is essential. In-sample isotonic calibration overestimates improvement by ~0.01 RAE. Our v50 catastrophic regression (CV 0.454 β†’ LB 0.658) was caught only by the honest-CV gate.

  4. Tail compression is the structural bottleneck. Our predictions max at ~6.3 vs train max 7.55. Only 11 train compounds (0.3%) have pEC50 β‰₯ 6.5. No loss-function tweak or ensemble method can extrapolate from 11 examples. Closing the gap to top-25 (Ξ” = 0.013 LB RAE) would require either proprietary data, 3D structural features, or a genuinely new architecture β€” each requiring days of setup beyond our time budget.

  5. The ceiling is real. After 35+ experiments (27 in headless loops + 8 in interactive sessions), every model class converged to Spearman > 0.87 vs v43. The 2D-graph + descriptor space is fully explored. v43 is the best achievable with public data and current infrastructure.


10. Code and Reproducibility

All code, environment specifications, and trained model checkpoints are published at https://huggingface.co/RyeCatcher/openadmet-pxr-challenge-2026 under Apache 2.0. The repository includes:

  • code/baseline/ β€” LGBM/XGBoost baselines (v2, v3, v4)
  • code/featurization/ β€” Mordred, CheMeleon, FCFP4 wrappers
  • code/multitask/T1_chemprop_external_pretrain.py β€” Chemprop multitask training
  • code/multitask/T1v5_chemprop_chembl_nr1i2.py β€” 5-head Chemprop with ChEMBL NR1I2
  • code/multitask/T2_autogluon_chemeleon.py β€” AutoGluon-tabular on CheMeleon
  • code/ensemble/v43_final_blend.py β€” Final blend reproducing the submission
  • code/submit/submit_v43.py β€” Gradio API submission script
  • models/ β€” Trained Chemprop and AutoGluon checkpoints (v43 lineage)
  • data/oof_predictions/ β€” Per-track out-of-fold and test predictions
  • submission/v43_final.parquet β€” The 513-row submission
  • code/requirements.txt β€” Pinned dependency list

Dependencies: Python 3.12, RDKit 2026.03.1, Chemprop 2.2.3, PyTorch 2.11.0+cu130, LightGBM 4.6.0, XGBoost 2.1.4, scikit-learn 1.8.0, pandas 3.0.2


11. Hardware

Training was performed on an NVIDIA DGX Spark (GB10 Blackwell, 128GB unified memory). Typical wall-clock times:

  • Chemprop pretrain (30 epochs): ~3 min
  • Chemprop 5-fold finetune: ~8 min
  • LightGBM kitchen-sink: ~15 sec
  • AutoGluon-CheMeleon: ~3.5 hours

Report finalized 2026-05-11. Best submission: v43 (RAE 0.586 LB, rank ~40/211). Phase 1 deadline: 2026-05-25.