Machine Learning in R: Repurpose Machine Learning Code for New Data

Data Professor · Beginner ·📰 AI News & Updates ·6y ago

Key Takeaways

The video demonstrates how to repurpose machine learning code in R for new data, using the iris and dhfr datasets, and covering data preprocessing, analysis, and visualization, as well as model training and evaluation using cross-validation and feature importance.

Full Transcript

welcome back to the data professor YouTube channel my name is tenant Anson Ahmad and I'm an associate professor of bioinformatics and on this channel we provide tutorials and concepts about data science so if you're into this kind of content please consider subscribing so in this video we're going to talk about how you can modify or adapt the code that we have been talking about in the previous two episodes into your own data set so let's have a look at some of the examples that are provided already in the data set library in our so why don't you fire up your art studio or our studio cloud and open up the iris data understanding for this file iris data understanding let's create a copy just copy everything create a new file paste the content and let's save it as dhfr - data under standing dot R so dhfr is a drug discovery data set so let's have a look at that okay so it's close the average data set okay so what you want to do for this is you want to load in the data sets library of r and instead of saying data you want to use iris let's have a look at the ACF R but before doing that let me show you if you type in data and then open parenthesis you will notice that there are a lot of options for you to choose from so you will notice that there's a lot of data set that you can play around with so let's have a look what do they have here so they have the German credit data Sacramento data ta so far yeah so there's a lot of data sets then you can play around with barley environmental ethanol melanoma okay so if you're starting out I would recommend you to play around with the data so that you can get a glimpse of the various types of data science tasks that you could do because the data set are quite diverse they are very heterogeneous in that some might be numerical some might be qualitative so you have a lot to play with let's say that you want to create a data mining model that will predict whether a compound or a drug will be a good drug or a bad drug so let's load in the dhf our data set so what you want to do is type in dat a parentheses dhfr right then you hit control enter so we're just gonna skip this okay so we're just gonna comment it out so the second method could be like this so you could type in DSF our two data sets and then colon colon dhfr and if your data set is available as a CSV and if you want to download it from your github you can use the following lines and you could modify the name accordingly so because we're not gonna use it so we're just going to comment it out by using the hash tag okay so let's clear the environment from the previous session okay so loading the data set again okay now you see that in the environment the DSF our data frame now appears and you will see that there are a total of 325 observation and there are 229 variables so let's have a look inside so control enter on the line saying view dhfr let me save that first and then you will see the data frame here so this data set dhf r is the data set comprising of 325 compounds each compound would represent a molecule a drug molecule and each drug molecule will be described by 228 molecular descriptor which tells about physical property the chemical properties of the drug molecule and why is the biological activity of the drug molecule whether it is active or inactive by active it means that the drug molecule is potent against the dhfr protein while if it is inactive it that the drug molecule is not efficient in binding or in exerting the desirable binding effect towards the dhfr protein and so what is the dhfr protein it is the dihydrofolate reductase enzyme and so this enzyme protein is very important for having the anti-malarial activity ok so we see that Y is the property that we are going to predict whereas there are over 228 descriptors that we'll be describing about the drug molecule ok so let's do some summary statistics so we replaced iris with dhfr ok let's have a look at the head part ok so we're gonna see the first five roles first five data objects and we see all of the column variables here so there are over 229 meaning that it has two hundred and twenty eight molecular descriptor which are the independent variables while one of the variable is the biological activity of the drug molecule it is either inactive or active how do I know that so if you type in the hf r dollar sign and then Y okay so the dollar sign will allow you to select the desirable variable and so you see that there are both active and inactive okay so there are 325 objects and the tail will work in a similar fashion but then you see the last five data objects of the data frame summary dhfr will allow you to see the summary table so you see the minimum and the first quartile the median the mean values of the data set or you can also have a look at specific columns so when you type in summary dhfr dollar sign y-you see that there are 203 active molecules and there are 122 in active molecules so you notice that there is an imbalance of the data set so in a future video I will talk about how you can handle the imbalance data set because this will influence the prediction performance and also the the reliability or the confidence that we have in the prediction results because when there is one class that is greater than the other class okay so let's check if there are any missing values and there are none of course because these data sets have already been curated and they are example data sets that are commonly used when you are starting out data science ok so let's load in the library schemer and then you just want to replace the iris with the SFR and you get to see the summary stats in more detail so for each variable you will have the various quartiles and the histogram distribution so you see at a glance the mean standard deviation is there any missing values and the distribution so that will come in handy okay so what if you want to see the data by according to the group of active and inactive drug molecule then four species we're gonna say is the Y or the biological activity and so we're gonna change iris to dhf R and group by it's going to be Y because Y is the biological activity that we want to predict and so let's control-enter that so we will see a summary statistics of the 220 variables for the actives and for the inactive separately so if you have a look or examine each of these variables they come in pairs as you will notice that the name appears in duplicate and because one of them is for active and one of them is for inactive so then you're going to see the mean and the standard deviation so you're gonna see that for the active molecule the value of this mole to the Zagreb is positive 0.195 while the inactive molecule will be negative and zero minus zero point one eight one okay so you can even do some t-test to evaluate the significant differences among these two groups okay so let's have a quick realization of the data set I don't think the plot will work because there are over two and and twenty-eight variables but let's just give it a try anyway right it's too big because there are 228 variables so that is too big to fit into this plot or probably too computationally demanding okay so plot doesn't work here so let's just comment it out so you probably have to do it manually you're probably going to have to visualize the data set manually so what about we select some descriptors here me2 the ACC Zagreb does have that and then D dash F our dollar sign okay and another nearby descriptor control-enter okay here's the scatterplot of this to molecular descriptors so there seem to be quite high correlation between them okay and so you can modify this and add the colors to it right so making it red color so let's just make a copy this DSF are dollar sign why let's have a look what happens here okay so this will allow us to see specific values of the scatter plot of the active versus the inactive compounds so it's gonna be color by y right and if you want to label it and the x axis is the sag grip and the y axis is winner pole okay so the label changes now so let's have a look at the histogram so then we will go for this if our dollar sign mo e to D underscores a grip and show enter and change this copy-paste this okay let's see if this works the SF are the issue are and then species will become Y so we're gonna have a look at the first four variables okay so that that didn't work let's see what's wrong here okay I think I know so we should change one to be stay to the five because the first one is the Y variable okay now it works so now we're going to see the feature importance plot for the active versus the inactive drug molecule for the first for molecular descriptors as shown here because the first position is occupied by the Y variable which is the biological activity and so we're going to show the second column to the fifth column which represents the first four variables and so that's the feature importance plot here we could try more if you like let's go for another another for or half eight okay I'm not sure how many this will be able to handle let's go with another four and another force call me okay so it's missing two more so let's make it Oh tomorrow 21 okay there you go so we have a five by four very cool plot you can click on the zoom button and then you're going to see a bigger preview so here we can quickly see that some variables there are no significant differences between the active and the inactive drug molecule like for this F charge so both are pretty much having not much data so the data variance of these active and inactive drug molecule are not different because there is no variation in the data distribution because they are essentially almost all 0 having all 0 values whereas for the inactive there might be some having some information whereas the rest have no information at all it lacks the data probably have a value of 0 okay so this variable we probably would have to cut it out and so for the other one we can see the differences that can be observed between the active and the inactive molecule okay so this concludes this first script on the data understanding and so let's continue to the next one okay so let's now proceed to the iris classification file so we're going to copy all of the code and then we're going to move it into a new file so let's clear the memory again so let's login the data sets library loading the carrot package load in the dhf our data set so for this you might need to use the our studio cloud because I already checked previously that the our studio did not have the DSF our data so it might just be a version thing and so let's okay so I'll tell you what I'll create a CSV file of this VFR and I'm going to upload it to the github of the data professor and so I will share the links down below please check in the description can you check for any missing data so now we're gonna set the seed okay and for this data splitting we're going to change iris to be dhfr we're going to change species to be y and we're gonna initiate that so let's type in the TFR here and the SF r and then let's run it control in turn to in turn okay so 261 objects and 64 objects would add up to 325 okay so we're just gonna skip this and let's go ahead and save this SDF are - classification dot R and so model equals to trained species would have to be Y and then yeah everything would be the same for the CVA species would be why everything else would be the same so let's run this control-enter and then let's build the cross-validation so for this I had to change species to why okay now let's have a look at the prediction results so the model training confusion matrix is shown here so we see that the accuracy was 0.99 to 3 and for the confusion matrix we could see that most of the molecules 162 were accurately predicted to be actives whereas one of them is miss predicted to be inactive and out of the 98 in actives one is Miss predicted while 97 was correctly predicted to be inactive and so the sensitivity is 0.99 3 9 specificity is 0.98 so that's where the training set for the testing set the accuracy is 0.92 1:9 and the confusion matrix revealed that out of the 41 active molecules 3 were miss predicted to be inactive whereas 38 were correctly predicted for the 23 inactive molecule 2 were miss predicted to be active and 21 were correctly predicted to be inactive and the sensitivity is 0.95 specificity is 0.875 so let's go to the cross-validation the accuracy of the cross-validation is 0.99 to 3 and out of 160 3 active drug molecule one of them is mispredicted whereas 162 is correctly predicted to be active and out of the 98 inactive molecule one of them is miss predicted to be active and 97 is correctly predicted to be inactive and the sensitivity is 0.99 3 9 while the specificity is 0.98 9 8 okay so now let's have a look at the feature importance so importance variable importance function let's run this control-enter and then let's have a look at the plot importance and so you're gonna see that there are so many descriptors over 228 and so it's very difficult to see even if we click on the zoom button there are over 228 variables and very hard to see indeed so what you want to do is you want to probably visualize let's say top 25 so we're gonna add the top equals to 25 to the plot function and here you go you're gonna see the top 25 descriptors having been ranked according to the importance okay so that's all for today thank you for watching please like subscribe and share and I'll see you in the next one but in the meantime please check out these videos

Original Description

After watching the recent R tutorial videos on this channel, you might be wondering how you can apply or adapt the R code to your own data. In this video, I will show you how you can repurpose the R code from previous videos and apply it to model a new dataset. 🌟 Buy me a coffee: https://www.buymeacoffee.com/dataprofessor ⭕Data and Code: https://raw.githubusercontent.com/dataprofessor/data/master/dhfr.csv https://github.com/dataprofessor/code/tree/master/dhfr ⭕ Timeline 0:36 Launch RStudio or RStudio.cloud 0:43 Open iris-data-understanding.R file 0:48 Create a copy of iris-data-understanding.R 1:01 Save as dhfr-data-understanding.R 1:09 What is DHFR? 2:37 Load in DHFR data, type: library(datasets) and then data(dhfr) 5:00 Perform summary statistics 7:28 Use skimr package to explore the data 10:06 Make a scatter plot 11:55 Make a histogram 12:23 Make feature plots 15:26 Let's build the DHFR classification model 15:49 Load in the libraries 16:38 Set the seed for reproducibility 17:27 Build the training and CV models 18:05 Let's look at prediction results 19:51 Let's make Feature importance plots ⭕ Playlist: Check out our other videos in the following playlists. ✅ Data Science 101: https://bit.ly/dataprofessor-ds101 ✅ Data Science YouTuber Podcast: https://bit.ly/datascience-youtuber-podcast ✅ Data Science Virtual Internship: https://bit.ly/dataprofessor-internship ✅ Bioinformatics: http://bit.ly/dataprofessor-bioinformatics ✅ Data Science Toolbox: https://bit.ly/dataprofessor-datasciencetoolbox ✅ Streamlit (Web App in Python): https://bit.ly/dataprofessor-streamlit ✅ Shiny (Web App in R): https://bit.ly/dataprofessor-shiny ✅ Google Colab Tips and Tricks: https://bit.ly/dataprofessor-google-colab ✅ Pandas Tips and Tricks: https://bit.ly/dataprofessor-pandas ✅ Python Data Science Project: https://bit.ly/dataprofessor-python-ds ✅ R Data Science Project: https://bit.ly/dataprofessor-r-ds ⭕ Subscribe: If you're new here, it would mean the world to me if you would con
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Data Professor · Data Professor · 22 of 60

1 How a Biologist became a Data Scientist
How a Biologist became a Data Scientist
Data Professor
2 WEKA Tutorial #1.1 - How to Build a Data Mining Model from Scratch
WEKA Tutorial #1.1 - How to Build a Data Mining Model from Scratch
Data Professor
3 WEKA Tutorial #1.2 - How to Build a Data Mining Model from Scratch
WEKA Tutorial #1.2 - How to Build a Data Mining Model from Scratch
Data Professor
4 WEKA Tutorial #1.3 - How to Build a Data Mining Model from Scratch
WEKA Tutorial #1.3 - How to Build a Data Mining Model from Scratch
Data Professor
5 Computational Drug Discovery: Machine Learning for Making Sense of Big Data in Drug Discovery
Computational Drug Discovery: Machine Learning for Making Sense of Big Data in Drug Discovery
Data Professor
6 Quotes #1 on Big Data and Data Science
Quotes #1 on Big Data and Data Science
Data Professor
7 Quotes #2 on Big Data and Data Science
Quotes #2 on Big Data and Data Science
Data Professor
8 Quotes #3 on Big Data and Data Science
Quotes #3 on Big Data and Data Science
Data Professor
9 Quotes #4 on Big Data and Data Science
Quotes #4 on Big Data and Data Science
Data Professor
10 Quotes #5 on Big Data and Data Science
Quotes #5 on Big Data and Data Science
Data Professor
11 Data Science 101: Starting a Data Science / Data Mining Project
Data Science 101: Starting a Data Science / Data Mining Project
Data Professor
12 Data Science 101: CRISP-DM - Data Mining / Data Science in 6 Steps
Data Science 101: CRISP-DM - Data Mining / Data Science in 6 Steps
Data Professor
13 R Programming 101: How to Define Variables
R Programming 101: How to Define Variables
Data Professor
14 R Programming 101: Read and Write CSV files
R Programming 101: Read and Write CSV files
Data Professor
15 Data Science 101: Basic Command-Line for Data Science
Data Science 101: Basic Command-Line for Data Science
Data Professor
16 Strategies for Learning Data Science in 2020 (Data Science 101)
Strategies for Learning Data Science in 2020 (Data Science 101)
Data Professor
17 Building your Data Science Portfolio with GitHub (Data Science 101)
Building your Data Science Portfolio with GitHub (Data Science 101)
Data Professor
18 R Programming 101: Setting up R programming environment (R, RStudio and RStudio.cloud)
R Programming 101: Setting up R programming environment (R, RStudio and RStudio.cloud)
Data Professor
19 Exploratory Data Analysis in R: Towards Data Understanding
Exploratory Data Analysis in R: Towards Data Understanding
Data Professor
20 Exploratory Data Analysis in R: Quick Dive into Data Visualization
Exploratory Data Analysis in R: Quick Dive into Data Visualization
Data Professor
21 Machine Learning in R: Building a Classification Model
Machine Learning in R: Building a Classification Model
Data Professor
Machine Learning in R: Repurpose Machine Learning Code for New Data
Machine Learning in R: Repurpose Machine Learning Code for New Data
Data Professor
23 Data Science 101: Deploying your Machine Learning Model
Data Science 101: Deploying your Machine Learning Model
Data Professor
24 Machine Learning in R: Deploy Machine Learning Model using RDS
Machine Learning in R: Deploy Machine Learning Model using RDS
Data Professor
25 Data Pre-processing in R: Handling Missing Data
Data Pre-processing in R: Handling Missing Data
Data Professor
26 Machine Learning in R: Speed up Model Building with Parallel Computing
Machine Learning in R: Speed up Model Building with Parallel Computing
Data Professor
27 Data Science 101: Overview of Machine Learning Model Building Process
Data Science 101: Overview of Machine Learning Model Building Process
Data Professor
28 Web Apps in R: Building your First Web Application in R | Shiny Tutorial Ep 1
Web Apps in R: Building your First Web Application in R | Shiny Tutorial Ep 1
Data Professor
29 Web Apps in R: Build Interactive Histogram Web Application in R | Shiny Tutorial Ep 2
Web Apps in R: Build Interactive Histogram Web Application in R | Shiny Tutorial Ep 2
Data Professor
30 Web Apps in R: Building Data-Driven Web Application in R | Shiny Tutorial Ep 3
Web Apps in R: Building Data-Driven Web Application in R | Shiny Tutorial Ep 3
Data Professor
31 Web Apps in R: Building the Machine Learning Web Application in R | Shiny Tutorial Ep 4
Web Apps in R: Building the Machine Learning Web Application in R | Shiny Tutorial Ep 4
Data Professor
32 Web Apps in R: Build BMI Calculator web application in R for health monitoring | Shiny Tutorial Ep 5
Web Apps in R: Build BMI Calculator web application in R for health monitoring | Shiny Tutorial Ep 5
Data Professor
33 Machine Learning in R: Building a Linear Regression Model
Machine Learning in R: Building a Linear Regression Model
Data Professor
34 What programming language to learn for Data Science? R versus Python
What programming language to learn for Data Science? R versus Python
Data Professor
35 How to Become a Data Scientist (Learning Path and Skill Sets Needed)
How to Become a Data Scientist (Learning Path and Skill Sets Needed)
Data Professor
36 Using Python in R
Using Python in R
Data Professor
37 Interpretable Machine Learning Models
Interpretable Machine Learning Models
Data Professor
38 Making Scatter Plots in R [Data Visualisation in R series]
Making Scatter Plots in R [Data Visualisation in R series]
Data Professor
39 Machine Learning in Python: Building a Classification Model
Machine Learning in Python: Building a Classification Model
Data Professor
40 Compare Machine Learning Classifiers in Python
Compare Machine Learning Classifiers in Python
Data Professor
41 Hyperparameter Tuning of Machine Learning Model in Python
Hyperparameter Tuning of Machine Learning Model in Python
Data Professor
42 Practical Introduction to Google Colab for Data Science
Practical Introduction to Google Colab for Data Science
Data Professor
43 File Handling in Google Colab for Data Science
File Handling in Google Colab for Data Science
Data Professor
44 Pandas for Data Science: Create and Combine DataFrames / Rename Columns
Pandas for Data Science: Create and Combine DataFrames / Rename Columns
Data Professor
45 Machine Learning in Python: Building a Linear Regression Model
Machine Learning in Python: Building a Linear Regression Model
Data Professor
46 Machine Learning in Python: Principal Component Analysis (PCA) for Handling High-Dimensional Data
Machine Learning in Python: Principal Component Analysis (PCA) for Handling High-Dimensional Data
Data Professor
47 How to Plot an ROC Curve in Python | Machine Learning in Python
How to Plot an ROC Curve in Python | Machine Learning in Python
Data Professor
48 Installing conda on Google Colab for Data Science
Installing conda on Google Colab for Data Science
Data Professor
49 Use native R on Google Colab for Data Science
Use native R on Google Colab for Data Science
Data Professor
50 How to Save and Download files from Google Colab
How to Save and Download files from Google Colab
Data Professor
51 Easy Web Scraping in Python using Pandas for Data Science
Easy Web Scraping in Python using Pandas for Data Science
Data Professor
52 Data Science for Computational Drug Discovery using Python (Part 1)
Data Science for Computational Drug Discovery using Python (Part 1)
Data Professor
53 Pandas Profiling for Data Science (Quick and Easy Exploratory Data Analysis)
Pandas Profiling for Data Science (Quick and Easy Exploratory Data Analysis)
Data Professor
54 Exploratory Data Analysis in Python using pandas
Exploratory Data Analysis in Python using pandas
Data Professor
55 Quick tour of PyCaret (a low-code machine learning library in Python)
Quick tour of PyCaret (a low-code machine learning library in Python)
Data Professor
56 How to Upload Files to Google Colab
How to Upload Files to Google Colab
Data Professor
57 How to Install and Use Pandas Profiling on Google Colab
How to Install and Use Pandas Profiling on Google Colab
Data Professor
58 How to Adjust the Style of Pandas DataFrame
How to Adjust the Style of Pandas DataFrame
Data Professor
59 How to use Bamboolib for Data Wrangling in Data Science
How to use Bamboolib for Data Wrangling in Data Science
Data Professor
60 How to use Pandas Profiling on Kaggle
How to use Pandas Profiling on Kaggle
Data Professor

This video teaches how to adapt machine learning code in R to new data, covering data preprocessing, analysis, and visualization, as well as model training and evaluation. By following the steps, viewers can learn how to apply machine learning concepts to their own data.

Key Takeaways
  1. Load the iris dataset from R's data library
  2. Load the dhfr dataset from R's data library
  3. Create a copy of the dhfr dataset and save it as dhfr-data-understanding.R
  4. View the dhfr dataset and its summary statistics
  5. Look at the first five rows of the dhfr dataset
  6. Replace iris dataset with dhfr dataset
  7. Perform summary statistics using R libraries
  8. Load library schemer for data curation and visualization
  9. Group data by biological activity (active/inactive)
  10. Use t-test to evaluate significant differences between groups
💡 The video highlights the importance of adapting machine learning code to new data, and demonstrates how to do so using R and various libraries and tools.

Related AI Lessons

You Are Not Behind. The World Is.
You're not behind, the world is still adapting to AI, and it's okay to take your time to learn and grow
Medium · AI
Career choice with the advent of AI - pure Computer Science or learn software with a background of core engineering area
Learn how to choose between a Computer Science and Engineering career path or combining programming with a core engineering background in the age of AI
Dev.to AI
The AI Hype Cycle: Calm Before the Next Breakthrough?
Understand the AI hype cycle to anticipate the next breakthrough and make informed decisions
Medium · Programming
AI won’t replace scientists. It will make the current model of science obsolete
AI is not replacing scientists, but rather making the current model of science obsolete, enabling new forms of discovery and collaboration
Medium · Data Science

Chapters (17)

0:36 Launch RStudio or RStudio.cloud
0:43 Open iris-data-understanding.R file
0:48 Create a copy of iris-data-understanding.R
1:01 Save as dhfr-data-understanding.R
1:09 What is DHFR?
2:37 Load in DHFR data, type: library(datasets) and then data(dhfr)
5:00 Perform summary statistics
7:28 Use skimr package to explore the data
10:06 Make a scatter plot
11:55 Make a histogram
12:23 Make feature plots
15:26 Let's build the DHFR classification model
15:49 Load in the libraries
16:38 Set the seed for reproducibility
17:27 Build the training and CV models
18:05 Let's look at prediction results
19:51 Let's make Feature importance plots
Up next
Motorist saved by human chain | 9 News Australia
9 News Australia
Watch →