Data Science for Computational Drug Discovery using Python (Part 2 with PyCaret)
Key Takeaways
This video demonstrates the use of PyCaret for machine learning model building in computational drug discovery, utilizing scikit-learn, XGBoost, and CatBoost algorithms, and computing molecular descriptors with RDKit. The video covers model building, comparison, tuning, interpretation, and prediction using PyCaret.
Full Transcript
so a couple of weeks ago i've shown you how you could use the python library called the pi carat which is a low code machine learning library in python and the example data set that i've shown you was the iris data set and so perhaps you're wondering what if you have your own data set and you would like to use that for your machine learning model building in pi karen so the advantage of using pi caret is that it allows you to quickly prototype machine learning models meaning that you could quickly generate machine learning models based on several algorithms for example if you're using classification or if you're using regression then you could pretty much use all of the learning algorithms that are available from scikit learn and other packages such as xt boost or cat boost and so in my daytime job as an associate professor bioinformatics i also do a lot of research and as part of doing research we have to explore various machine learning algorithms and traditionally if we were to custom code machine learning pipelines in python using scikit-learn that would pretty much take us maybe a couple of hours but by using pi carrier this could pretty much be compressed in only a couple of minutes and so in today's video i'm going to show you how you could quickly generate machine learning models in pi caret and so the data set that we're going to be using today which is particularly based on the molecular solubility and so the data set was originally published in one of the journals by the american chemical society and so today we're going to reproduce that work and instead of using only linear regression we're also going to be using several machine learning algorithms provided by the automated pipeline of pi carrot and so the links to both of these videos that i have mentioned will be provided up and below in the description and so without further ado we're starting right now okay so in this tutorial i'm going to split the jupyter notebook file into two and the first one will be called part 2.1 and the second one will be called part 2.2 and so in part 2.1 we're going to be generating the molecular descriptor file which has also been covered in the previous video but today we're going to do that very rapidly and so if you would like to have a in-depth discussion or explanation of what each line of code is doing then you want to refer to that previous video and so please find the link up and below okay so let's get started we're going to install conda and the libraries and so this should take a couple of minutes and as previously shown in the private video we're using the dilani solubility dataset and the original paper is provided in this link and the original link to the data set is provided in this link as well in 2.1 and so i have already downloaded the data set onto the github of the data professor and so the links will be in here and so we could directly read it in using pandas and so thanks to boris for suggesting the use of the url directly in the read csv function of pandas okay let's have a look it's currently installing okay so we're installing conda python 3.7 and we're also installing rd kits and so rdkit will allow us to compute molecular descriptors and so this is a chem informatic python library and if you are curious about artikit this library is only available in python so if you're using r then you wanna give this package a try and and so perhaps it is one of the reasons for using python for chem informatics and vice versa there are packages only available for r and so to each their own so some library are in python some library are only exclusively on r and so that is why i'm using both languages okay so let's compute the descriptors okay we have to define this first read in the data set right here click the molecular descriptor okay so lengthier explanation will be provided in the previous video as mentioned so here we're just computing the descriptors and we're splitting the x and y matrices looking at the distribution and then combining it back in and then we're gonna write it out into a csv file and this csv file will be provided on the github of data professor so i have shown you this steps just in case that you would like to try this on your own chemical library okay so we have already created the file and i'm going to show you the link to this file which we are going to be using for part 2.2 okay so it's right here delani solubility with descriptors so the first one delany.csv let me show you so this will contain the raw data it's going to contain the name of the compounds the measured solubility the predicted solubility from the paper using the linear regression and the smiles notation and the smiles notation will be chemical information in the one dimension and so the descriptor calculating software in rd kit library will be using the smallest notation as the input okay and then it will generate a set of molecular descriptor which will be shown here so it's going to generate descriptors such as log p molecular weights number of rotatable bonds aromatic proportion and log s is the experimental values and so as noted previously in the prior video the aromatic proportion descriptor was generated using a custom function okay so this is the data set we're going to be using for part 2.2 all right so let's start by installing pi caret in a couple of seconds it will finish installation all right so let's load in the pi carrot library for regression and then we're gonna import pandas as pd and this is the link to the solubility data set i've mentioned along with the molecular descriptors that were calculated let's have a brief look okay so it's the same data set that i've previously mentioned so it has five columns and 1144 rows all right so let's start with the model building so the first step is okay so i have already run the above so why don't i delete this because i have already included here okay and then we're going to set up the model let's run it and so here we specify the name of the data set which is called data set here because we read it in as a data set data frame and then we're going to specify the target which is log s and so log s is right here the column called log s so this is the y variable that we will be predicting and the rest will be used as the x variable okay and notice that this is a simple pandas data frame and so we just read it indirectly from the csv file and then we're using it immediately in the setup function of pi caret and so here we're specifying the training set size to be eighty percent okay and we're making it silent equal to true so that we don't see any of the messages okay and let's proceed so the subsequent blocks of code here we'll be using the training set which is the 80 and then finally at the very end of this notebook i'm going to be using the trained model of the 80 in order to be testing it on the 20 okay so if you're new to machine learning and data science then i have created a simple visual guide to how to build a machine learning model so that article was published on medium and towards data science and so i'm going to provide you the link in the video description as well and so that article will be a gentle introduction to the field of machine learning and data science okay so let's continue hit on this cell compare models so as i mentioned previously we're going to be building several machine learning algorithms model that are provided by scikit-learn by catbooz by xgboost and so this is simply performed in only one line of code made possible by pi current and imagine if you were to code this manually all of these 21 models then it's gonna take you a couple of minutes if not minutes then a couple of hours at least a few hours okay so this is very conveniently done for you all right so let's have a look here so here we see that the best model is provided by the extra tree regression and this gives us a r square of 0.879 and let's compare that to the previous one that we have built let's go to code go to python and then it is the chem informatics predicting solubility and so this is from the previous video all right so let's have a look so the r square here is it's the r square on the test set right here r square on the training set is 77 0.77 and the one provided today is 0.879 a significant boost to the model performance and so we're going to continue further with the et regression okay and coming in in second place is random forest so as noted in one of the prior podcasts i've mentioned that random forest was one of my favorite learning algorithms and actually without using pie carrot i wouldn't have known that extra tree regression would have performed better than random forest and so this gave us fresh perspective in trying out new algorithms as well okay and so here we're going to continue to use the et algorithm which is the best performing one and it is abbreviated right here as et and so i'm going to define it et equals to create underscore model and then et let's run it okay and here we have a performance table showing all of the performance metric for the 10 cross validation and the mean value from the 10 cross validation is shown here along with the standard deviation so as you can see generally it is 0.8793 same as above left two in the model so by tuning the model we're going to optimize the parameters and let's see whether the performance will increase okay so this should take a couple of moments and so here we have set the number of iterations to be 50 and we're going to use the mean absolute error as the fitness function and so we're seeing that the r square is increasing in some of the fold okay and then from the 10 cross validation we saw that the performance increased slightly to 0.8854 and so it was previously eight seven nine three and it is now eight eight five four so this is the detail from the trained model and so for reproducibility you might wanna set the random state to be 7903 in order for it to give you the same results here okay so now comes the fun part let's have a look at the various plots for the models so let's have a look at the residuals and we're going to use the plot model function and the input argument here will be the name of the model et comma the residuals and so imagine creating this manually using seaborne or matpotlib so that might take you a couple of hours to do so okay and let's have a look at the prediction error plot okay so the scatter plot of the actual value and the predicted value cook's distance plots so for outlier detection recursive feature selection and in the background we're going to run the other one as well all right here so this is the recursive feature selection so it is shown here that out of the four molecule descriptors it is shown here that the use of only two feature could provide in excess of 0.85 for the r square and that the use of the remaining two descriptor will slightly improve the performance of the prediction okay so that's something interesting to see and let's have a look at the learning curve so the blue curve you see here are the training score and the green will be the cross validated score validation curve comparing the training score versus the cross validation score and the manifold learning plot using four features and the feature importance so we can see here that the log p is the highly ranked feature followed by the molecular weight aromatic proportion and rotatable bonds and so probably the two feature are log p and molecular weight from the above plot here right here number feature to be two and so these are the hyper parameters of the model number of estimators 100 random state 7903 and this is the hyper parameter of the tuned model so map step 40. number of estimator has been changed to 280 and the random state is the same at 7903 okay and here is the showing all the plots so you could click on each of these panel and it will show us but some of them were not working or maybe it's taking some time to run so let's continue model interpretation so the great thing i like about this package is the nice interpretation provided by the shape library and so here we're seeing the contribution of the features to the model and so as previously shown above generally the feature importance plot that we will be seeing will provide us only the information about which one was the most important and so an important point to note here is that whenever we make a feature importance plot we're going to see which feature are the most important for example we could see that moloch p provided the most variable importance followed by more weight aromatic proportion and number of rotatable bonds but what we're not seeing is that how are they contributing to the model for example if we have two classes class a and class b active compound and inactive compound and so we could see that log p is contributing the most but are they contributing the most to the active compound or are they contributing the most to the inactive compound okay and so with the shape package here we're gonna see the contribution of each feature looking at the shape value here okay whether it is bending toward the negative or whether it is tending toward the positive okay and let's have a look further okay this is the correlation plot and let's have a look here at the recent plot at the observation level and so this recent plot which is caught by pi turret and it is better known as the forest plot which is termed by the shape library and the plot will essentially describe the push and pull effect that each of the individual feature used to build the model has on the base value of the prediction so the base value of the prediction let's think of it as kind of like the y-intercept so y-intercept could be thought of as the base value for example if your equation is y equals to five x plus five and so the base value will be five and the coefficient value five x that will be the feature importance okay so for a simple linear regression there's no problem in interpreting at which direction does each feature has on the prediction of the model whether it has a positive effect or a negative effect which we could have a look at the coefficient values whether it's positive or negative or whether the value has high magnitude or low magnitude so high magnitude meaning it will have higher value for the coefficient and lower magnitude it will have lower value for the coefficient and so the force plot will beautifully show you that in this plot so the base value here is six so the base value here is minus 6.72 and we could see that all of the descriptor here are making the value lower okay and so different model will be using different features in different ways okay and so here we're going to see that the four feature are pushing the base value lower so it's having a negative effect toward the output value prediction and so for this particular model it is showing that all of the four descriptors are having a negative effect on the output value so it is pushing the value to be lower from the base value of -6.72 so for another data set using other algorithms it might be the case that some descriptors are pushing it higher some descriptors are pushing it lower okay okay so that's all for the testing on the 80 subset and now we're gonna use the 80 model and making a prediction on the hold out or the left out 20 subsets and so let's do that using the predict model code and so we're going to see that the performance on the test set or the 20 is 0.8671 and so let's have a look at some of the predicted output here so here are the label which is the predicted value and the log s are the experimental values okay so the prediction are pretty good right the actual value is minus 5.47 predicted to be minus 5.08 minus 2.18 predicted to be minus 1.9772 okay so if you're finding value in this video please give it a thumbs up subscribe if you haven't yet done so hit on the notification bell in order to be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey thank you for watching please like subscribe and share and i'll see you in the next one but in the meantime please check out these videos
Original Description
In Part 1 (https://youtu.be/VXFFHHoE1wk), I have shown you step-by-step in this End-to-end Bioinformatics / Cheminformatics tutorial on how to use Data Science in a Computational Drug Discovery project as we reproduce the research work of Delaney by predicting the solubility of molecules in Python using scikit-learn, rdkit and pandas libraries.
This video is Part 2, where I will show you how to apply the same dataset (molecular solubility dataset) on the PyCaret Python library to generate several machine learning models in a few simple steps.
🌟 Buy me a coffee: https://www.buymeacoffee.com/dataprofessor
⭕ Code:
✅ Part 1: https://github.com/dataprofessor/code/blob/master/python/cheminformatics_predicting_solubility.ipynb
✅ Part 2.1: https://github.com/dataprofessor/code/blob/master/python/cheminformatics_predicting_solubility_2_1_PyCaret.ipynb
✅ Part 2.2: https://github.com/dataprofessor/code/blob/master/python/cheminformatics_predicting_solubility_2_2_PyCaret.ipynb
📚Delaney's ORIGINAL ARTICLE entitled
"ESOL: Estimating Aqueous Solubility Directly from Molecular Structure"
https://pubs.acs.org/doi/10.1021/ci034243x
📚Read my EDITORIAL ARTICLE entitled
"Maximizing computational tools for successful drug discovery"
https://www.tandfonline.com/doi/full/10.1517/17460441.2015.1016497
⭕ Playlist:
Check out our other videos in the following playlists.
✅ Data Science 101: https://bit.ly/dataprofessor-ds101
✅ Data Science YouTuber Podcast: https://bit.ly/datascience-youtuber-podcast
✅ Data Science Virtual Internship: https://bit.ly/dataprofessor-internship
✅ Bioinformatics: http://bit.ly/dataprofessor-bioinformatics
✅ Data Science Toolbox: https://bit.ly/dataprofessor-datasciencetoolbox
✅ Streamlit (Web App in Python): https://bit.ly/dataprofessor-streamlit
✅ Shiny (Web App in R): https://bit.ly/dataprofessor-shiny
✅ Google Colab Tips and Tricks: https://bit.ly/dataprofessor-google-colab
✅ Pandas Tips and Tricks: https://bit.ly/dataprofessor-pandas
✅ Python Data Scie
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Data Professor · Data Professor · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
How a Biologist became a Data Scientist
Data Professor
WEKA Tutorial #1.1 - How to Build a Data Mining Model from Scratch
Data Professor
WEKA Tutorial #1.2 - How to Build a Data Mining Model from Scratch
Data Professor
WEKA Tutorial #1.3 - How to Build a Data Mining Model from Scratch
Data Professor
Computational Drug Discovery: Machine Learning for Making Sense of Big Data in Drug Discovery
Data Professor
Quotes #1 on Big Data and Data Science
Data Professor
Quotes #2 on Big Data and Data Science
Data Professor
Quotes #3 on Big Data and Data Science
Data Professor
Quotes #4 on Big Data and Data Science
Data Professor
Quotes #5 on Big Data and Data Science
Data Professor
Data Science 101: Starting a Data Science / Data Mining Project
Data Professor
Data Science 101: CRISP-DM - Data Mining / Data Science in 6 Steps
Data Professor
R Programming 101: How to Define Variables
Data Professor
R Programming 101: Read and Write CSV files
Data Professor
Data Science 101: Basic Command-Line for Data Science
Data Professor
Strategies for Learning Data Science in 2020 (Data Science 101)
Data Professor
Building your Data Science Portfolio with GitHub (Data Science 101)
Data Professor
R Programming 101: Setting up R programming environment (R, RStudio and RStudio.cloud)
Data Professor
Exploratory Data Analysis in R: Towards Data Understanding
Data Professor
Exploratory Data Analysis in R: Quick Dive into Data Visualization
Data Professor
Machine Learning in R: Building a Classification Model
Data Professor
Machine Learning in R: Repurpose Machine Learning Code for New Data
Data Professor
Data Science 101: Deploying your Machine Learning Model
Data Professor
Machine Learning in R: Deploy Machine Learning Model using RDS
Data Professor
Data Pre-processing in R: Handling Missing Data
Data Professor
Machine Learning in R: Speed up Model Building with Parallel Computing
Data Professor
Data Science 101: Overview of Machine Learning Model Building Process
Data Professor
Web Apps in R: Building your First Web Application in R | Shiny Tutorial Ep 1
Data Professor
Web Apps in R: Build Interactive Histogram Web Application in R | Shiny Tutorial Ep 2
Data Professor
Web Apps in R: Building Data-Driven Web Application in R | Shiny Tutorial Ep 3
Data Professor
Web Apps in R: Building the Machine Learning Web Application in R | Shiny Tutorial Ep 4
Data Professor
Web Apps in R: Build BMI Calculator web application in R for health monitoring | Shiny Tutorial Ep 5
Data Professor
Machine Learning in R: Building a Linear Regression Model
Data Professor
What programming language to learn for Data Science? R versus Python
Data Professor
How to Become a Data Scientist (Learning Path and Skill Sets Needed)
Data Professor
Using Python in R
Data Professor
Interpretable Machine Learning Models
Data Professor
Making Scatter Plots in R [Data Visualisation in R series]
Data Professor
Machine Learning in Python: Building a Classification Model
Data Professor
Compare Machine Learning Classifiers in Python
Data Professor
Hyperparameter Tuning of Machine Learning Model in Python
Data Professor
Practical Introduction to Google Colab for Data Science
Data Professor
File Handling in Google Colab for Data Science
Data Professor
Pandas for Data Science: Create and Combine DataFrames / Rename Columns
Data Professor
Machine Learning in Python: Building a Linear Regression Model
Data Professor
Machine Learning in Python: Principal Component Analysis (PCA) for Handling High-Dimensional Data
Data Professor
How to Plot an ROC Curve in Python | Machine Learning in Python
Data Professor
Installing conda on Google Colab for Data Science
Data Professor
Use native R on Google Colab for Data Science
Data Professor
How to Save and Download files from Google Colab
Data Professor
Easy Web Scraping in Python using Pandas for Data Science
Data Professor
Data Science for Computational Drug Discovery using Python (Part 1)
Data Professor
Pandas Profiling for Data Science (Quick and Easy Exploratory Data Analysis)
Data Professor
Exploratory Data Analysis in Python using pandas
Data Professor
Quick tour of PyCaret (a low-code machine learning library in Python)
Data Professor
How to Upload Files to Google Colab
Data Professor
How to Install and Use Pandas Profiling on Google Colab
Data Professor
How to Adjust the Style of Pandas DataFrame
Data Professor
How to use Bamboolib for Data Wrangling in Data Science
Data Professor
How to use Pandas Profiling on Kaggle
Data Professor
More on: ML Pipelines
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
How to prepare TIC teacher exams in Spain with AI (oposiciones 2026)
Dev.to AI
Why I built a simple AI provider wrapper (and you might too)
Dev.to · zhongqiyue
This ChatGPT Prompt Replaced 3 Hours of PowerPoint Work
Medium · AI
This ChatGPT Prompt Replaced 3 Hours of PowerPoint Work
Medium · ChatGPT
🎓
Tutor Explanation
DeepCamp AI