docs/METHODOLOGY_REPORT.md · RyeCatcher/openadmet-pxr-challenge-2026 at 79d558a2a56a9ff8f9b32e80153985c6a828417f

openadmet-pxr-challenge-2026 / docs /METHODOLOGY_REPORT.md

RyeCatcher

Add files using upload-large-folder tool

1757924 verified 24 days ago

preview code

raw

history blame

11.7 kB

	# Methodology Report — OpenADMET PXR Induction Blind Challenge

	Team: BioInfo (RyeCatcher)
	Track: Activity Prediction
	Best Submission: v43 (LB RAE = 0.586, rank ~40/211 as of 2026-05-10)
	Contact: justin@rundatarun.io
	Code, models, OOF predictions, and submission: https://huggingface.co/RyeCatcher/openadmet-pxr-challenge-2026
	Report version: 1.1 (2026-05-19)

	---

	## 1. Overview

	Our approach is an ensemble of three model families: (1) a Chemprop D-MPNN multitask model pretrained on external ADMET data, (2) a gradient-boosted decision tree ensemble on rich molecular descriptors, and (3) an AutoGluon-tabular model on CheMeleon foundation-model embeddings. Final predictions are produced via a weighted blend with isotonic calibration.

	---

	## 2. Data Used

	### 2.1 OpenADMET Provided Data

	We used all provided data sources:

	\| Source \| Size \| Usage \|
	\|---\|---\|---\|
	\| DRC pEC50 (train) \| 4,139 compounds \| Primary regression target \|
	\| Counter-screen pEC50 \| 2,647 compounds (subset of train) \| Multitask Head 2 \|
	\| Single-concentration log2fc \| 21,003 rows (10,870 unique compounds) \| Exploratory multitask (not in final ensemble) \|
	\| Test (blinded) \| 513 compounds \| Prediction target \|

	### 2.2 External Data

	We incorporated the following public datasets for pretraining and auxiliary multitask learning:

	\| Dataset \| Source \| Size \| Usage \|
	\|---\|---\|---\|---\|
	\| NCATS qHTS PXR (AID 1346982) \| PubChem \| 9,671 compounds \| Binary activity classification head \|
	\| NCATS qHTS LogAC50 (AID 1346985) \| PubChem \| 2,458 compounds \| Regression head \|
	\| Tox21 SR-ARE \| MoleculeNet/DeepChem \| ~8,000 compounds \| Binary classification head (SR-PXR was unavailable in this release) \|
	\| ChEMBL NR1I2 (PXR) \| ChEMBL \| 907 compounds with pchembl_value \| Multitask Head 5 in T1v5 \|
	\| BindingDB PXR (UniProt O75469) \| BindingDB REST API \| 364 novel compounds \| Exploratory pretrain data \|

	No proprietary data was used. All external data was deduplicated against the OpenADMET train and test sets using canonical isomeric SMILES and InChIKey.

	---

	## 3. Feature Engineering

	### 3.1 Molecular Fingerprints

	- FCFP4-count-1024: RDKit `MorganGenerator` with `MorganFeatureAtomInvGen`, radius=2, fpSize=1024, count fingerprint. This outperformed binary ECFP4 and MACCS in our coverage analysis (90.6% test-to-train Tanimoto ≥ 0.4 coverage).

	### 3.2 Molecular Descriptors

	- RDKit descriptors: 217 descriptors from `useful_rdkit_utils.get_rdkit_desc_names()` including physicochemical properties, topological indices, and fragment counts.
	- Mordred: 1,613 2D/3D descriptors computed with `mordredcommunity`.
	- CheMeleon embeddings: 2,048-dimension frozen embeddings from the CheMeleon foundation model (Recursion Pharma), extracted via the OpenADMET toolkit.

	### 3.3 Learned Embeddings

	- Chemprop D-MPNN fingerprints: 300-dimension aggregation-layer embeddings extracted from the pretrained multitask model.

	---

	## 4. Models

	### 4.1 Model Family A: Chemprop D-MPNN Multitask (T1)

	A directed message-passing neural network (D-MPNN) implemented in Chemprop v2 with multitask pretraining:

	- Architecture: BondMessagePassing (d_h=300, depth=3) → MeanAggregation → RegressionFFN (4 tasks, 3 hidden layers, dropout=0.1)
	- Heads:
	1. pEC50 regression (OpenADMET DRC)
	2. NCATS qHTS binary active/inactive
	3. NCATS qHTS LogAC50 regression
	4. Tox21 SR-ARE binary classification
	- Pretraining: 30 epochs on combined dataset (train + external) with equal head weights
	- Finetuning: 50 epochs with Head 1 weighted 4× on OpenADMET train only
	- Loss: MSE (binary tasks treated as regression to 0/1 targets)

	### 4.2 Model Family B: LightGBM Kitchen-Sink (v4)

	- Features: 4,902-dimension vector = RDKit 217 + FCFP4-count-1024 + Mordred 1,613 + CheMeleon 2,048
	- Model: LightGBM with MSE loss, 5-fold CV, early stopping
	- Hyperparameters: tuned via median fold-best-iteration heuristic for full-train model

	### 4.3 Model Family C: AutoGluon on CheMeleon (T2)

	- Features: CheMeleon 2,048-dimension frozen embeddings
	- Model: AutoGluon TabularPredictor with `best_quality` preset
	- Runtime: ~3.5 hours for 5-fold OOF generation

	### 4.4 Model Family D: Chemprop + ChEMBL Multitask (T1v5)

	Extends T1 with an additional head for ChEMBL NR1I2 PXR pchembl_value (907 compounds). 5-head Chemprop with the same architecture as T1.

	---

	## 5. Ensemble Strategy

	Our final submission (v43) is a cascade-weighted blend built iteratively:

	```
	v7 = 0.50 × T1 + 0.38 × v4 + 0.12 × v3 (isotonic calibrated)
	v22 = 0.975 × v7 + 0.025 × KERMT
	v26 = 0.90 × v22 + 0.10 × CheMeleon-FT
	S6 = 0.90 × v26 + 0.10 × TabPFN
	v28 = S6 + T1v2 (CYP3A4 multitask) at w=0.12
	v29 = v28 + T16 (cliff-weighted Chemprop) at w=0.10
	v31 = 0.83 × v29 + 0.17 × T2 (AutoGluon-CheMeleon)
	v43 = 0.78 × v31 + 0.22 × T1v5 (ChEMBL multitask)
	```

	All weights were determined by grid search maximizing honest OOF RAE improvement with a minimum Δ threshold of +0.003 vs the previous step.

	Calibration: IsotonicRegression (out_of_bounds='clip') applied to blend outputs. For honest CV evaluation, per-fold IsotonicRegression is fitted on each fold's OOF separately to avoid leakage.

	---

	## 6. Validation Strategy

	### 6.1 Cross-Validation

	- Scheme: Butina clustering at Tanimoto cutoff 0.4 on ECFP4 2048-bit fingerprints, followed by GroupKFold (5 folds)
	- Rationale: Groups chemically similar compounds together to simulate the challenge's analog-set test construction
	- Primary metric: RAE (Relative Absolute Error) = Σ\|y_true - y_pred\| / Σ\|y_true - mean(y_true)\|

	### 6.2 Honest Calibration Protocol

	We distinguish two calibration protocols:
	- In-sample isotonic: Single IsotonicRegression fit on full-train OOF (used for test submission; slightly optimistic)
	- Honest per-fold isotonic: Separate IsotonicRegression fit per CV fold on that fold's OOF only (used for candidate gating; conservative)

	All candidates must pass the honest per-fold iso gate (Δ ≥ +0.003 vs v43 honest OOF 0.4798) before queueing for submission.

	### 6.3 Statistical Rigor

	- Cluster-bootstrap 95% CI: Bootstrapped at the Butina-cluster level (not compound level) to account for chemical similarity structure
	- Bootstrap iterations: 1,000 with replacement

	---

	## 7. Submission History & Performance

	\| Tag \| Date \| LB RAE \| LB Rank \| Key Change \|
	\|---\|---\|---\|---\|---\|
	\| v2-baseline-xgb \| 2026-05-06 \| 0.7412 \| 145/199 \| XGBoost MAE on FCFP4 + RDKit \|
	\| v3-baseline-lgbm \| 2026-05-06 \| 0.7249 \| 131/199 \| LightGBM MSE, Butina CV \|
	\| v4-kitchen-sink \| 2026-05-07 \| 0.6889 \| 112/201 \| +Mordred +CheMeleon (4902d) \|
	\| v7-ensemble \| 2026-05-07 \| 0.6039 \| 42/202 \| +Chemprop D-MPNN multitask \|
	\| v31-blend \| 2026-05-07 \| 0.5966 \| 42/207 \| +AutoGluon-CheMeleon + KERMT + CheMeleon-FT \|
	\| v43-final \| 2026-05-08 \| 0.586 \| ~40/211 \| +T1v5 ChEMBL multitask, clean hierarchy \|
	\| v43-defensive \| 2026-05-11 \| 0.586 \| ~40/211 \| Defensive re-submit; no improved candidate found \|

	CV-to-LB shift trend: 0.170 → 0.158 → 0.134 → 0.107. The shift narrows as model quality improves.

	Final rank as of 2026-05-11: 40-42 / 211 (top 19%). Gap to top-25: ~0.013 RAE. Gap to #1 (Yan): ~0.090 RAE.

	---

	## 8. Exploratory Work (Not in Final Ensemble)

	We tested but did not include the following approaches due to honest-CV gate failure:

	\| Approach \| Honest OOF RAE \| Reason for Exclusion \|
	\|---\|---\|---\|
	\| Single-conc multitask (T1v7) \| 0.6076 \| Diluted primary task; log2fc scale mismatch \|
	\| BindingDB + ChEMBL broad pretrain (T1v9) \| 0.5538 \| Distribution mismatch; no improvement over T1 \|
	\| MaskMol / MAE ViT-Base \| 0.7122 \| Too weak solo; zero blend utility \|
	\| TabPFN on CheMeleon \| 0.5551 \| Spearman vs v43 = 0.931 (too correlated) \|
	\| GIN + ACtriplet (T11) \| 0.5839 \| Spearman vs v43 = 0.890 (too correlated) \|
	\| Uni-MolV2 310M fine-tune \| 0.630 \| Poor convergence; high correlation \|
	\| Boltz-2 structural confidence \| 0.845 \| Confidence scores ≠ binding affinity \|
	\| Differentiable ensemble (14 OOFs) \| 0.4966 \| Worse than v43 alone; correlated errors \|
	\| ADMET-AI features + LGBM \| 0.6206 \| Too weak solo; minimal blend utility \|
	\| Tail-weighted LGBM (α=5) \| 0.5619 \| Marginal improvement on weak baseline \|
	\| Precision-weighted LGBM \| 0.5534 \| Best non-v43 LGBM; still far from ceiling \|
	\| KERMT higher-weight blend \| 0.4939 \| Spearman vs v43 = 0.910; no improvement \|
	\| Artifact removal (pEC50 < 2) \| 0.5745 \| Removing "artifacts" hurt generalization \|
	\| Test-time SMILES augmentation \| 0.0376 MAE \| No benefit over canonical SMILES \|

	---

	## 9. Key Learnings

	1. Chemprop D-MPNN is the dominant diversity contributor. All attempts to replace or augment it with transformer-based models (ChemBERTa, GIN, Uni-Mol) converged to Spearman > 0.88 vs the D-MPNN representation, indicating equivalent 2D-graph learning on this dataset.

	2. External data quality > quantity. ChEMBL broad (2,122 compounds) and BindingDB (364 compounds) did not improve pretrain quality because their pActivity distributions (median ~5.0–6.6) differed from OpenADMET (median ~4.3). Distribution mismatch outweighed volume.

	3. Honest per-fold calibration is essential. In-sample isotonic calibration overestimates improvement by ~0.01 RAE. Our v50 catastrophic regression (CV 0.454 → LB 0.658) was caught only by the honest-CV gate.

	4. Tail compression is the structural bottleneck. Our predictions max at ~6.3 vs train max 7.55. Only 11 train compounds (0.3%) have pEC50 ≥ 6.5. No loss-function tweak or ensemble method can extrapolate from 11 examples. Closing the gap to top-25 (Δ = 0.013 LB RAE) would require either proprietary data, 3D structural features, or a genuinely new architecture — each requiring days of setup beyond our time budget.

	5. The ceiling is real. After 35+ experiments (27 in headless loops + 8 in interactive sessions), every model class converged to Spearman > 0.87 vs v43. The 2D-graph + descriptor space is fully explored. v43 is the best achievable with public data and current infrastructure.

	---

	## 10. Code and Reproducibility

	All code, environment specifications, and trained model checkpoints are published at https://huggingface.co/RyeCatcher/openadmet-pxr-challenge-2026 under Apache 2.0. The repository includes:

	- `code/baseline/` — LGBM/XGBoost baselines (v2, v3, v4)
	- `code/featurization/` — Mordred, CheMeleon, FCFP4 wrappers
	- `code/multitask/T1_chemprop_external_pretrain.py` — Chemprop multitask training
	- `code/multitask/T1v5_chemprop_chembl_nr1i2.py` — 5-head Chemprop with ChEMBL NR1I2
	- `code/multitask/T2_autogluon_chemeleon.py` — AutoGluon-tabular on CheMeleon
	- `code/ensemble/v43_final_blend.py` — Final blend reproducing the submission
	- `code/submit/submit_v43.py` — Gradio API submission script
	- `models/` — Trained Chemprop and AutoGluon checkpoints (v43 lineage)
	- `data/oof_predictions/` — Per-track out-of-fold and test predictions
	- `submission/v43_final.parquet` — The 513-row submission
	- `code/requirements.txt` — Pinned dependency list

	Dependencies: Python 3.12, RDKit 2026.03.1, Chemprop 2.2.3, PyTorch 2.11.0+cu130, LightGBM 4.6.0, XGBoost 2.1.4, scikit-learn 1.8.0, pandas 3.0.2

	---

	## 11. Hardware

	Training was performed on an NVIDIA DGX Spark (GB10 Blackwell, 128GB unified memory). Typical wall-clock times:
	- Chemprop pretrain (30 epochs): ~3 min
	- Chemprop 5-fold finetune: ~8 min
	- LightGBM kitchen-sink: ~15 sec
	- AutoGluon-CheMeleon: ~3.5 hours

	---

	Report finalized 2026-05-11. Best submission: v43 (RAE 0.586 LB, rank ~40/211). Phase 1 deadline: 2026-05-25.