Exploratory Data Analysis in R: Towards Data Understanding
Key Takeaways
The video demonstrates exploratory data analysis in R using the iris dataset, covering data loading, manipulation, and visualization, as well as summary statistics and grouping by species. Tools used include R, curl library, and GitHub.
Full Transcript
welcome back to the data professor YouTube channel if you new here my name is tenon and TAS cinema and I'm an associate professor of bioinformatics and on this YouTube channel I teach about data science concepts and tutorials so if you're into this kind of content please consider subscribing so today we're going to start our first our data science project you can get started by building your very own portfolio which you can also share onto your github so this will be your portfolio for your data science project and your data science journey so if you are embarking upon becoming a data scientist this will be a very important part of your journey so it's kind of like you're documenting your journey by doing the project that you will be learning from this channel and as you are doing so you can save your progress into your github which is kind of like your portfolio and over time if you modify codes create new functions apply your data science project to a new set of data make some tweak to the parameters you will have a growing portfolio on your github so if you're in this journey then this is a very important part let's get started fire up the google chrome or your internet browser and also direct yourself to the our studio cloud or if you're comfortable in using our studio on your local computer please do so the thing that I like about the art studio that cloud is because I'm able to work anywhere on any platform so let's say that I'm using a Windows or Linux or a Macintosh even an iPhone or even a tablet so you can use the Art Studio dark cloud on any of these devices and it will show roughly a working version even on an iPhone ok so the good thing is that you can access your code and your data on many devices ok so let's load in this iris project and so let's get started by allowing you to do some data Under so if you haven't gone through the six steps of doing a data science project please go to one of the videos in the data science 101 one of the first video so links will be down below okay so in this video I'm going to cover about how you can perform the very first step of a data science project which is to gain an understanding of your data sets so in this example we're going to use an established data set called the iris data set which has already been used in one of our week a tutorial where you were shown how to analyze that data set in one of the videos that have previously made so also links down below it will be also be nice to compare contrast the model that's generated by using R and also by using week so you wanted load in the data set so you have here many ways of doing that so let's go with method number one the iris data set which is already available in the base package of are called the data sets package so you want to load that in so you will type in the command library data sets okay and then you will type in data parenthesis iris okay notice that there is an object created caught iris here and if you type view with iris you will see it asks the data frame shown here so we have already loaded the data set using the first method but I will show you how to load it using the second method so let me clean that again I'll click on the broom icon which will allow you to clear objects from your environment okay so I want to load in the library caught our curl so if you haven't yet got that library installed you can type in installed packages parentheses quotation mark our curl okay and in closing quotation closing parentheses enter and then after you have successfully installed the package then you want to load it in using the library our curl command if you're uncertain whether you have this package you can go to the packages tab and then find our curl do I have it yes I do have it here so our curl is already installed on my computer okay so I'll load in the library our curl and then I will retrieve the data set directly from my github so I will show you how does that look like so on the github calm slash data professor click on the repositories click on the data ok so over time I intend to compile data sets I'm particularly popular data set all from all over the Internet and I'll also provide the original links to the original locations from which the data set was obtained so for your convenience it will be in this data folder ok so please come back and have a look at this in the future for more datasets so the iris dot CSV file will look like this ok so in my code I will retrieve this file the the raw form so I can click on the link here that URL and paste it here so it's the same URL as you saw a moment ago so I will use the function get URL and in parenthesis and in quotation mark I'm going to type in the URL which I can just right click and copy paste it right in here okay and it will read the CSV and it will assign the resulting content of Irish dot CSV file into the iris object so let me run this command so I hit on ctrl enter and so it will run or alternatively you can click on the line that you would like to run in and click on the button run ok and then you see that the object iris has been shown you can click on it or you could type in view parentheses iris and then you will see the data frame of the data set right here ok so let me show you how to use briefly the data frame if you type in iris followed by a dollar sign you will see the available variables that are in your data frame and if you click on one of them and enter you will see the contents of that file as a vector or as a list okay and on the keyboard you can hit on the up arrow and it will bring the previous command and you can just modify it iris and dollar sign and then I can type a new variable let's say I choose species and then the species value will be shown here so I have 150 flowers so 52 Tosa 50 first color 50 virginica are showing right here okay so this is the data frame I can do neat stuff with this as well I can assign this species into a species variable like this species and it will have the same values in here okay I can do neat stuff like that as well okay so now you have successfully import the iris dataset into the environment and now we're ready to begin so that now let's go to the next stage which is to display the summary statistics of your data so your data set is within this object called iris so you might want to just type in iris and see what happens so you're gonna see the data sense okay which looks like if I type in view iris and I see a path a spreadsheet like view so there's four columns super length sepal width petal length peter witt and species column so these four variables are the independent variables which will allow the prediction model to learn the characteristics of the different types of flower in which there are three types of flower see Tosa virginica versicolor and on the basis of these four characteristics the prediction model will be able to predict the type of flower so in here we're gonna do some basic summary statistics but before that let's use the command head and tail to see the first four lines or the last four lines of the data sets I'll head on control enter and here you see I'll see the first four rows and I'll see the last four rows using the head and the tail command okay and this number I can modify that to be five if I want to see five rows see there's one two five or the tail the last five rows so the next stage is to look at the summary statistics of the iris dataset so let's type in summary parentheses iris for each of the variable let's say starting from simple length I will see the minimum value which is four point three I will see the maximum value which is seven point nine I would see the first quartile the third quartile value and the mean and the median value here okay for each of the four variables I will see the same data as I have previously mentioned and in the species which is the class label I will see that there are 50 flowers for each class of flower and the summary command I can also select specific variable as I have shown you by clicking the dollar sign command here now we show only the sepal link okay and so you're gonna see the summary stats of only the sepal link okay a very handy command that will allow you to see whether your data set has any missing data is to use this you do the summation function and then within the summation function inside in the parentheses you're going to type in that is dot an a command which will try to find whether there is any values containing the n/a and a means a missing value in your data sets and so this retrieves zero it means that there is no missing values in the data okay but in an actual data set there might be some missing values and if you have missing values in your data set you will have to handle that in such a way that it will allow you to do sound analysis so I will show you that in future videos so there is a package called the skin R which expands on the summary function by providing larger set of statistics so if you don't have that yeah you can install it by typing install got packages parentheses quotations kim r s ki m r and more details are provided on this link so I'm going to share this code on the github in the code folder and I'll put it in the comments down below so check out the description of this video for the link of this R code file and also the accompanying iris dot CSV file as well so let's load in the library skim R and let's see what it does so I type in the command skim iris okay and so I will see the summary statistics the name of the dataset it's called iris there's a total of 150 roles there's a total of five columns it detects that there is one factor which is three species and there is four numeric values okay and for the species there's three sets each of the three class label there's fifty for each and for each of the variables sepal length sepal width pillow lane petal width there is no missing values here in this column and missing the number of missing value which is zero so there's nothing missing the mean value the standard deviation okay and the various quartiles of your data set and the rough histogram of your data so you get to see a rough distribution of your data you see that there's two population for both petal and laying and width so you're gonna see there's two population here because the bars are separate it so let's say that we want to use the skim function by grouping it according to the species because there are three flower type and so for each flower type what's the mean value what's the median value so I can do that by using this command so I'll highlight it and ctrl enter and so here you go for the first variable sepal length I see that force Atossa it has a mean value of five and for versicolor it has a mean value of five point nine four and for a versicolor it has a mean value of six point five nine so I see that for the first variable sepal length virginica seems to have a higher value it has a higher mean whereas the Tollison versicolor has a lower mean than the virginica and for the simple web I can see that the site OSA has a higher mean than both the versicolor and the virginica which has roughly similar mean values and here at piddling I can see that the site OSA has significantly lower mean than both the versicolor and virginica for the zoo Tulsa it also has significantly lower mean for the petal width from this I can get a rough idea of how the data set are distributed thank you for watching please like subscribe and share and I'll see you in the next one but in the meantime please check out these videos
Original Description
In this video, I provide a quick overview on how you can gain data understanding by performing exploratory data analysis.
🌟 Buy me a coffee: https://www.buymeacoffee.com/dataprofessor
⭕ Timeline
0:19 Kicking off the first R Data Science Project "Exploratory Data Analysis"
0:52 Growing your Data science portfolio
1:18 Fire up RStudio.cloud or RStudio
2:32 Perform "Exploratory Data Analysis" in order to gain "Data Understanding"
2:57 Load the Iris dataset (Several ways shown)
3:37 Load Iris Method 2: Download using getURL()
6:33 Playing around with iris dataframe
7:52 Summary statistics
9:27 Look at summary statistics using summary()
10:29 Summation of missing data
11:39 Codes shared at https://github.com/dataprofessor/code/tree/master/iris
11:52 skimr library provides a more in-depth summary statistics of dataset
⭕ Playlist:
Check out our other videos in the following playlists.
✅ Data Science 101: https://bit.ly/dataprofessor-ds101
✅ Data Science YouTuber Podcast: https://bit.ly/datascience-youtuber-podcast
✅ Data Science Virtual Internship: https://bit.ly/dataprofessor-internship
✅ Bioinformatics: http://bit.ly/dataprofessor-bioinformatics
✅ Data Science Toolbox: https://bit.ly/dataprofessor-datasciencetoolbox
✅ Streamlit (Web App in Python): https://bit.ly/dataprofessor-streamlit
✅ Shiny (Web App in R): https://bit.ly/dataprofessor-shiny
✅ Google Colab Tips and Tricks: https://bit.ly/dataprofessor-google-colab
✅ Pandas Tips and Tricks: https://bit.ly/dataprofessor-pandas
✅ Python Data Science Project: https://bit.ly/dataprofessor-python-ds
✅ R Data Science Project: https://bit.ly/dataprofessor-r-ds
⭕ Subscribe:
If you're new here, it would mean the world to me if you would consider subscribing to this channel.
✅ Subscribe: https://www.youtube.com/dataprofessor?sub_confirmation=1
⭕ Recommended Tools:
Kite is a FREE AI-powered coding assistant that will help you code faster and smarter. The Kite plugin integrates with all the top editors and IDEs to give you
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Data Professor · Data Professor · 19 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
▶
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
How a Biologist became a Data Scientist
Data Professor
WEKA Tutorial #1.1 - How to Build a Data Mining Model from Scratch
Data Professor
WEKA Tutorial #1.2 - How to Build a Data Mining Model from Scratch
Data Professor
WEKA Tutorial #1.3 - How to Build a Data Mining Model from Scratch
Data Professor
Computational Drug Discovery: Machine Learning for Making Sense of Big Data in Drug Discovery
Data Professor
Quotes #1 on Big Data and Data Science
Data Professor
Quotes #2 on Big Data and Data Science
Data Professor
Quotes #3 on Big Data and Data Science
Data Professor
Quotes #4 on Big Data and Data Science
Data Professor
Quotes #5 on Big Data and Data Science
Data Professor
Data Science 101: Starting a Data Science / Data Mining Project
Data Professor
Data Science 101: CRISP-DM - Data Mining / Data Science in 6 Steps
Data Professor
R Programming 101: How to Define Variables
Data Professor
R Programming 101: Read and Write CSV files
Data Professor
Data Science 101: Basic Command-Line for Data Science
Data Professor
Strategies for Learning Data Science in 2020 (Data Science 101)
Data Professor
Building your Data Science Portfolio with GitHub (Data Science 101)
Data Professor
R Programming 101: Setting up R programming environment (R, RStudio and RStudio.cloud)
Data Professor
Exploratory Data Analysis in R: Towards Data Understanding
Data Professor
Exploratory Data Analysis in R: Quick Dive into Data Visualization
Data Professor
Machine Learning in R: Building a Classification Model
Data Professor
Machine Learning in R: Repurpose Machine Learning Code for New Data
Data Professor
Data Science 101: Deploying your Machine Learning Model
Data Professor
Machine Learning in R: Deploy Machine Learning Model using RDS
Data Professor
Data Pre-processing in R: Handling Missing Data
Data Professor
Machine Learning in R: Speed up Model Building with Parallel Computing
Data Professor
Data Science 101: Overview of Machine Learning Model Building Process
Data Professor
Web Apps in R: Building your First Web Application in R | Shiny Tutorial Ep 1
Data Professor
Web Apps in R: Build Interactive Histogram Web Application in R | Shiny Tutorial Ep 2
Data Professor
Web Apps in R: Building Data-Driven Web Application in R | Shiny Tutorial Ep 3
Data Professor
Web Apps in R: Building the Machine Learning Web Application in R | Shiny Tutorial Ep 4
Data Professor
Web Apps in R: Build BMI Calculator web application in R for health monitoring | Shiny Tutorial Ep 5
Data Professor
Machine Learning in R: Building a Linear Regression Model
Data Professor
What programming language to learn for Data Science? R versus Python
Data Professor
How to Become a Data Scientist (Learning Path and Skill Sets Needed)
Data Professor
Using Python in R
Data Professor
Interpretable Machine Learning Models
Data Professor
Making Scatter Plots in R [Data Visualisation in R series]
Data Professor
Machine Learning in Python: Building a Classification Model
Data Professor
Compare Machine Learning Classifiers in Python
Data Professor
Hyperparameter Tuning of Machine Learning Model in Python
Data Professor
Practical Introduction to Google Colab for Data Science
Data Professor
File Handling in Google Colab for Data Science
Data Professor
Pandas for Data Science: Create and Combine DataFrames / Rename Columns
Data Professor
Machine Learning in Python: Building a Linear Regression Model
Data Professor
Machine Learning in Python: Principal Component Analysis (PCA) for Handling High-Dimensional Data
Data Professor
How to Plot an ROC Curve in Python | Machine Learning in Python
Data Professor
Installing conda on Google Colab for Data Science
Data Professor
Use native R on Google Colab for Data Science
Data Professor
How to Save and Download files from Google Colab
Data Professor
Easy Web Scraping in Python using Pandas for Data Science
Data Professor
Data Science for Computational Drug Discovery using Python (Part 1)
Data Professor
Pandas Profiling for Data Science (Quick and Easy Exploratory Data Analysis)
Data Professor
Exploratory Data Analysis in Python using pandas
Data Professor
Quick tour of PyCaret (a low-code machine learning library in Python)
Data Professor
How to Upload Files to Google Colab
Data Professor
How to Install and Use Pandas Profiling on Google Colab
Data Professor
How to Adjust the Style of Pandas DataFrame
Data Professor
How to use Bamboolib for Data Wrangling in Data Science
Data Professor
How to use Pandas Profiling on Kaggle
Data Professor
More on: ML Maths Basics
View skill →Related Reads
📰
📰
📰
📰
We Taught Machines to Talk. We Forgot to Teach Ourselves to Listen.
Medium · AI
Is the AI bubble about to burst? A data scientist’s honest take
Medium · AI
Is the AI bubble about to burst? A data scientist’s honest take
Medium · Machine Learning
Is the AI bubble about to burst? A data scientist’s honest take
Medium · Data Science
Chapters (12)
0:19
Kicking off the first R Data Science Project "Exploratory Data Analysis"
0:52
Growing your Data science portfolio
1:18
Fire up RStudio.cloud or RStudio
2:32
Perform "Exploratory Data Analysis" in order to gain "Data Understanding"
2:57
Load the Iris dataset (Several ways shown)
3:37
Load Iris Method 2: Download using getURL()
6:33
Playing around with iris dataframe
7:52
Summary statistics
9:27
Look at summary statistics using summary()
10:29
Summation of missing data
11:39
Codes shared at https://github.com/dataprofessor/code/tree/master/iris
11:52
skimr library provides a more in-depth summary statistics of dataset
🎓
Tutor Explanation
DeepCamp AI