openadmet-pxr-challenge-2026 / docs /METHODOLOGY_REPORT.md

Add files using upload-large-folder tool

1757924 verified 24 days ago

11.7 kB

Methodology Report — OpenADMET PXR Induction Blind Challenge

Team: BioInfo (RyeCatcher)
Track: Activity Prediction
Best Submission: v43 (LB RAE = 0.586, rank ~40/211 as of 2026-05-10)
Contact: justin@rundatarun.io
Code, models, OOF predictions, and submission: https://huggingface.co/RyeCatcher/openadmet-pxr-challenge-2026 Report version: 1.1 (2026-05-19)

1. Overview

Our approach is an ensemble of three model families: (1) a Chemprop D-MPNN multitask model pretrained on external ADMET data, (2) a gradient-boosted decision tree ensemble on rich molecular descriptors, and (3) an AutoGluon-tabular model on CheMeleon foundation-model embeddings. Final predictions are produced via a weighted blend with isotonic calibration.

2. Data Used

2.1 OpenADMET Provided Data

We used all provided data sources:

Source	Size	Usage
DRC pEC50 (train)	4,139 compounds	Primary regression target
Counter-screen pEC50	2,647 compounds (subset of train)	Multitask Head 2
Single-concentration log2fc	21,003 rows (10,870 unique compounds)	Exploratory multitask (not in final ensemble)
Test (blinded)	513 compounds	Prediction target

2.2 External Data

We incorporated the following public datasets for pretraining and auxiliary multitask learning:

Dataset	Source	Size	Usage
NCATS qHTS PXR (AID 1346982)	PubChem	9,671 compounds	Binary activity classification head
NCATS qHTS LogAC50 (AID 1346985)	PubChem	2,458 compounds	Regression head
Tox21 SR-ARE	MoleculeNet/DeepChem	~8,000 compounds	Binary classification head (SR-PXR was unavailable in this release)
ChEMBL NR1I2 (PXR)	ChEMBL	907 compounds with pchembl_value	Multitask Head 5 in T1v5
BindingDB PXR (UniProt O75469)	BindingDB REST API	364 novel compounds	Exploratory pretrain data

No proprietary data was used. All external data was deduplicated against the OpenADMET train and test sets using canonical isomeric SMILES and InChIKey.

3. Feature Engineering

3.1 Molecular Fingerprints

FCFP4-count-1024: RDKit MorganGenerator with MorganFeatureAtomInvGen, radius=2, fpSize=1024, count fingerprint. This outperformed binary ECFP4 and MACCS in our coverage analysis (90.6% test-to-train Tanimoto ≥ 0.4 coverage).

3.2 Molecular Descriptors

RDKit descriptors: 217 descriptors from useful_rdkit_utils.get_rdkit_desc_names() including physicochemical properties, topological indices, and fragment counts.
Mordred: 1,613 2D/3D descriptors computed with mordredcommunity.
CheMeleon embeddings: 2,048-dimension frozen embeddings from the CheMeleon foundation model (Recursion Pharma), extracted via the OpenADMET toolkit.

3.3 Learned Embeddings

Chemprop D-MPNN fingerprints: 300-dimension aggregation-layer embeddings extracted from the pretrained multitask model.

4. Models

4.1 Model Family A: Chemprop D-MPNN Multitask (T1)

A directed message-passing neural network (D-MPNN) implemented in Chemprop v2 with multitask pretraining:

Architecture: BondMessagePassing (d_h=300, depth=3) → MeanAggregation → RegressionFFN (4 tasks, 3 hidden layers, dropout=0.1)
Heads:
1. pEC50 regression (OpenADMET DRC)
2. NCATS qHTS binary active/inactive
3. NCATS qHTS LogAC50 regression
4. Tox21 SR-ARE binary classification
Pretraining: 30 epochs on combined dataset (train + external) with equal head weights
Finetuning: 50 epochs with Head 1 weighted 4× on OpenADMET train only
Loss: MSE (binary tasks treated as regression to 0/1 targets)

4.2 Model Family B: LightGBM Kitchen-Sink (v4)

Features: 4,902-dimension vector = RDKit 217 + FCFP4-count-1024 + Mordred 1,613 + CheMeleon 2,048
Model: LightGBM with MSE loss, 5-fold CV, early stopping
Hyperparameters: tuned via median fold-best-iteration heuristic for full-train model

4.3 Model Family C: AutoGluon on CheMeleon (T2)

Features: CheMeleon 2,048-dimension frozen embeddings
Model: AutoGluon TabularPredictor with best_quality preset
Runtime: ~3.5 hours for 5-fold OOF generation

4.4 Model Family D: Chemprop + ChEMBL Multitask (T1v5)

Extends T1 with an additional head for ChEMBL NR1I2 PXR pchembl_value (907 compounds). 5-head Chemprop with the same architecture as T1.

5. Ensemble Strategy

Our final submission (v43) is a cascade-weighted blend built iteratively:

v7   = 0.50 × T1 + 0.38 × v4 + 0.12 × v3  (isotonic calibrated)
v22  = 0.975 × v7 + 0.025 × KERMT
v26  = 0.90 × v22 + 0.10 × CheMeleon-FT
S6   = 0.90 × v26 + 0.10 × TabPFN
v28  = S6 + T1v2 (CYP3A4 multitask) at w=0.12
v29  = v28 + T16 (cliff-weighted Chemprop) at w=0.10
v31  = 0.83 × v29 + 0.17 × T2 (AutoGluon-CheMeleon)
v43  = 0.78 × v31 + 0.22 × T1v5 (ChEMBL multitask)

All weights were determined by grid search maximizing honest OOF RAE improvement with a minimum Δ threshold of +0.003 vs the previous step.

Calibration: IsotonicRegression (out_of_bounds='clip') applied to blend outputs. For honest CV evaluation, per-fold IsotonicRegression is fitted on each fold's OOF separately to avoid leakage.

6. Validation Strategy

6.1 Cross-Validation

Scheme: Butina clustering at Tanimoto cutoff 0.4 on ECFP4 2048-bit fingerprints, followed by GroupKFold (5 folds)
Rationale: Groups chemically similar compounds together to simulate the challenge's analog-set test construction
Primary metric: RAE (Relative Absolute Error) = Σ|y_true - y_pred| / Σ|y_true - mean(y_true)|

6.2 Honest Calibration Protocol

We distinguish two calibration protocols:

In-sample isotonic: Single IsotonicRegression fit on full-train OOF (used for test submission; slightly optimistic)
Honest per-fold isotonic: Separate IsotonicRegression fit per CV fold on that fold's OOF only (used for candidate gating; conservative)

All candidates must pass the honest per-fold iso gate (Δ ≥ +0.003 vs v43 honest OOF 0.4798) before queueing for submission.

6.3 Statistical Rigor

Cluster-bootstrap 95% CI: Bootstrapped at the Butina-cluster level (not compound level) to account for chemical similarity structure
Bootstrap iterations: 1,000 with replacement

7. Submission History & Performance

Tag	Date	LB RAE	LB Rank	Key Change
v2-baseline-xgb	2026-05-06	0.7412	145/199	XGBoost MAE on FCFP4 + RDKit
v3-baseline-lgbm	2026-05-06	0.7249	131/199	LightGBM MSE, Butina CV
v4-kitchen-sink	2026-05-07	0.6889	112/201	+Mordred +CheMeleon (4902d)
v7-ensemble	2026-05-07	0.6039	42/202	+Chemprop D-MPNN multitask
v31-blend	2026-05-07	0.5966	42/207	+AutoGluon-CheMeleon + KERMT + CheMeleon-FT
v43-final	2026-05-08	0.586	~40/211	+T1v5 ChEMBL multitask, clean hierarchy
v43-defensive	2026-05-11	0.586	~40/211	Defensive re-submit; no improved candidate found

CV-to-LB shift trend: 0.170 → 0.158 → 0.134 → 0.107. The shift narrows as model quality improves.

Final rank as of 2026-05-11: 40-42 / 211 (top 19%). Gap to top-25: ~0.013 RAE. Gap to #1 (Yan): ~0.090 RAE.

8. Exploratory Work (Not in Final Ensemble)

We tested but did not include the following approaches due to honest-CV gate failure:

Approach	Honest OOF RAE	Reason for Exclusion
Single-conc multitask (T1v7)	0.6076	Diluted primary task; log2fc scale mismatch
BindingDB + ChEMBL broad pretrain (T1v9)	0.5538	Distribution mismatch; no improvement over T1
MaskMol / MAE ViT-Base	0.7122	Too weak solo; zero blend utility
TabPFN on CheMeleon	0.5551	Spearman vs v43 = 0.931 (too correlated)
GIN + ACtriplet (T11)	0.5839	Spearman vs v43 = 0.890 (too correlated)
Uni-MolV2 310M fine-tune	0.630	Poor convergence; high correlation
Boltz-2 structural confidence	0.845	Confidence scores ≠ binding affinity
Differentiable ensemble (14 OOFs)	0.4966	Worse than v43 alone; correlated errors
ADMET-AI features + LGBM	0.6206	Too weak solo; minimal blend utility
Tail-weighted LGBM (α=5)	0.5619	Marginal improvement on weak baseline
Precision-weighted LGBM	0.5534	Best non-v43 LGBM; still far from ceiling
KERMT higher-weight blend	0.4939	Spearman vs v43 = 0.910; no improvement
Artifact removal (pEC50 < 2)	0.5745	Removing "artifacts" hurt generalization
Test-time SMILES augmentation	0.0376 MAE	No benefit over canonical SMILES

9. Key Learnings

Chemprop D-MPNN is the dominant diversity contributor. All attempts to replace or augment it with transformer-based models (ChemBERTa, GIN, Uni-Mol) converged to Spearman > 0.88 vs the D-MPNN representation, indicating equivalent 2D-graph learning on this dataset.
External data quality > quantity. ChEMBL broad (2,122 compounds) and BindingDB (364 compounds) did not improve pretrain quality because their pActivity distributions (median ~5.0–6.6) differed from OpenADMET (median ~4.3). Distribution mismatch outweighed volume.
Honest per-fold calibration is essential. In-sample isotonic calibration overestimates improvement by ~0.01 RAE. Our v50 catastrophic regression (CV 0.454 → LB 0.658) was caught only by the honest-CV gate.
Tail compression is the structural bottleneck. Our predictions max at ~6.3 vs train max 7.55. Only 11 train compounds (0.3%) have pEC50 ≥ 6.5. No loss-function tweak or ensemble method can extrapolate from 11 examples. Closing the gap to top-25 (Δ = 0.013 LB RAE) would require either proprietary data, 3D structural features, or a genuinely new architecture — each requiring days of setup beyond our time budget.
The ceiling is real. After 35+ experiments (27 in headless loops + 8 in interactive sessions), every model class converged to Spearman > 0.87 vs v43. The 2D-graph + descriptor space is fully explored. v43 is the best achievable with public data and current infrastructure.

10. Code and Reproducibility

All code, environment specifications, and trained model checkpoints are published at https://huggingface.co/RyeCatcher/openadmet-pxr-challenge-2026 under Apache 2.0. The repository includes:

code/baseline/ — LGBM/XGBoost baselines (v2, v3, v4)
code/featurization/ — Mordred, CheMeleon, FCFP4 wrappers
code/multitask/T1_chemprop_external_pretrain.py — Chemprop multitask training
code/multitask/T1v5_chemprop_chembl_nr1i2.py — 5-head Chemprop with ChEMBL NR1I2
code/multitask/T2_autogluon_chemeleon.py — AutoGluon-tabular on CheMeleon
code/ensemble/v43_final_blend.py — Final blend reproducing the submission
code/submit/submit_v43.py — Gradio API submission script
models/ — Trained Chemprop and AutoGluon checkpoints (v43 lineage)
data/oof_predictions/ — Per-track out-of-fold and test predictions
submission/v43_final.parquet — The 513-row submission
code/requirements.txt — Pinned dependency list

Dependencies: Python 3.12, RDKit 2026.03.1, Chemprop 2.2.3, PyTorch 2.11.0+cu130, LightGBM 4.6.0, XGBoost 2.1.4, scikit-learn 1.8.0, pandas 3.0.2

11. Hardware

Training was performed on an NVIDIA DGX Spark (GB10 Blackwell, 128GB unified memory). Typical wall-clock times:

Chemprop pretrain (30 epochs): ~3 min
Chemprop 5-fold finetune: ~8 min
LightGBM kitchen-sink: ~15 sec
AutoGluon-CheMeleon: ~3.5 hours

Report finalized 2026-05-11. Best submission: v43 (RAE 0.586 LB, rank ~40/211). Phase 1 deadline: 2026-05-25.