--- license: apache-2.0 language: - en tags: - chemistry - drug-discovery - admet - multitask-learning - openadmet --- This is our baseline Caco-2 permeability/LogD/PPB model. It is a **multitask CheMeleon model** trained to predict the following endpoints: - Caco-2 Permeability Papp A->B - Caco-2 Permeability Papp B->A - LogD - MPPB - HPPB > Check out comparative performance on the ExpansionRx dataset here: https://openadmet.ghost.io/zero-shot-expansiorx-admet-predictions/ > **Update Notice**: We have released an updated version of the model (**v2**) featuring enhanced curation of the ChEMBL training set. Key improvements include the exclusion of censored data (non-explicit modifiers like `<` or `>`) to ensure high-fidelity regression and optimized outlier filtering. The legacy version remains accessible via the commit history. ## Pre-requisites We *highly* recommend you have the Anvil framework from `openadmet-models` installed in an environment (called `openadmet-models`) for ease of use and full utilization of OpenADMET's models. For full documentation, visit our website [here](https://docs.openadmet.org/en/latest/). If you'd like to see some more examples on how to use Anvil, see our demos [here](https://demos.openadmet.org/en/latest/). ### Installation of `openadmet-models` #### With conda You can install openadmet-models via our GitHub package. If you want the latest development version, clone the repository and install in editable mode: ``` git clone git@github.com:OpenADMET/openadmet-models.git ``` Set up an environment using the provided files in devtools/conda-envs. ``` cd openadmet-models/ conda env create -f devtools/conda-envs/openadmet-models.yaml conda activate openadmet-models pip install -e . ``` If you want to use GPU acceleration, ensure you have the appropriate CUDA toolkit installed and use the openadmet-models-gpu.yaml file instead: ``` conda env create -f devtools/conda-envs/openadmet-models-gpu.yaml conda activate openadmet-models pip install -e . ``` #### With Docker Alternatively, you can also use Docker to spin up a containerized pre-installed environment to run `openadmet-models`. Just be sure you are mounting the correct folder (`./permeability-logd-ppb-chemeleon-baseline`) where you've downloaded the model. If you're using a gpu, run: ``` docker run -it --user=root --rm \ -v ./permeability-logd-ppb-chemeleon-baseline:/home/mambauser/model:rw \ --runtime=nvidia \ --gpus \ all ghcr.io/openadmet/openadmet-models:main ``` Otherwise, for cpu only: ``` docker run -it --user=root --rm \ -v ./permeability-logd-ppb-chemeleon-baseline:/home/mambauser/model:rw \ all ghcr.io/openadmet/openadmet-models:main ``` **IMPORTANT NOTE** You will also need `git lfs` installed. ## Downloading the model 1. After installing Anvil, clone the model repo: ``` git clone https://huggingface.co/openadmet/permeability-logd-ppb-chemeleon-baseline/ ``` 2. Change to the repo directory. Ensure you have `git lfs` installed for the repo and get the large model files: ``` git lfs install git lfs pull ``` 3. You are now ready to use the model! ## Using the model **IMPORTANT NOTE:** This model predicts \\(\log_{10}(P_{app})\\) values (on \\(\log_{10}(\text{cm/s})\\)). To get \\(P_{app}\\) values in \\(10^{-6} \text{cm/s}\\), simply backtransform: $$ P_{app} = 10^{\hat{y}} * 10^{6} $$ Where \\(\hat{y}\\) is our model prediction. For the protein binding endpoints, the model predicts \\(\log_{10}(\% \text{unbound})\\). To get PPB values from %bound, simply subtract \\(100 - \% \text{bound}\\). We will use this model for inference, to predict endpoint values for a set of molecular compounds unseen to the model. For demonstration purposes, we will be using a small-molecule set from our recent [OpenADMET-ExpansionRx challenge](https://huggingface.co/spaces/openadmet/OpenADMET-ExpansionRx-Challenge), provided in the file `expansion_data_inference.csv`. You can do this either **inside the docker container** as per the instructions above, or if you have installed openadmet-models on your own computer, you can use the appropriate environment. The generic command to run our inference pipeline is: ```bash openadmet predict \ --input-path \ --input-col \ --model-dir \ --output-csv \ --accelerator ``` You can run this directly in your command line, OR you can use the bash script we've provided, `run_model_inference.sh`. For our working example, this command becomes: ```bash openadmet predict \ --input-path expansion_data_inference.csv \ --input-col SMILES \ --model-dir anvil_training/ \ --output-csv predictions.csv \ --accelerator cpu ``` You can easily substitute your own set of compounds, simply modify the `--input-path` and `--input-col` arguments for your specific dataset. In our example, this outputs a file called `predictions.csv` which includes endpoint-specific prediction columns (as `OADMET_PRED_chemprop_{}`) for: - `caco2_atob_LogPapp` - `caco2_btoa_LogPapp` - `logD` - `mppb_LogUnbound` - `hppb_LogUnbound` In this case, `OADMET_STD_chemprop_{}` columns are empty because uncertainty cannot be estimated unless running inference on an ensemble of models. See how to set this option [here](https://demos.openadmet.org/en/latest/demos/04_Ensemble_Model_Training/04_Ensemble_Model_Training_Active_Learning.html). **IMPORTANT NOTE** If you'd like other examples for how to use our Anvil framework, checkout our demos [here](https://demos.openadmet.org/en/latest/).