What Do A Data Scientist Do?
Skills:
ML Maths Basics60%
Key Takeaways
Describes the role and responsibilities of a data scientist in a machine learning project
Full Transcript
Hello all, my name is Krishna and welcome to my YouTube channel. Today we'll be discussing about what do a data scientist do. Now we are basically going to discuss all the activities a data science bas data scientist basically does in a machine learning use case or let it be a deep learning use case. So I have noted down some of the steps and this particular steps covers all the pipelines of a life cycle of a data science project. So the first step to begin with is basically data collection. Now this particular step is very very important. Each and every data scientist whoever is working for a particular machine learning use case has to you know participate in the data collection. Now when they are participating in the data data collection they will also be stakeholders involved and those stakeholders will be having some domain knowledge about the project that we are trying to do over here. With the help of them we will try to extract the data and they will be providing references lot of references through it may be a third party API it may be through web scrapping and many other activities. Now whenever we are using thirdparty APIs some of the APIs will also be paid right and some of the APIs may also be free and with the help of web scrapping as you know that if the website already provides the web scrapping facilities we can basically extract all the data. Now whatever data is basically collected in the data collection stage is basically raw data. Okay, it will not be a very clean data. It is basically a raw kind of data. So in the second step what the data scientist does is that the team will start you know making themsel available for the data preparation. You know data preparation basically means that we'll try to clean the data. We'll try to make this noise. We'll try to remove the noise in from this particular data. We'll try to clean it and make it in a such a format that it can be very easily applicable or used for any machine learning algorithms. Now when I say cleaning this particular data that basically means now in this particular case we have a raw data. It may be in JSON format. It may be an XML format. So what we try to do is that we try to convert this into CSV format or some better formats. Let it be an excel sheet. Right? So that here the data is basically represented in tabular form. So in this particular data preparation a lot of use of pandas will be coming over here. Numpy arrays will be coming over here and all the different types of functionalities in pandas whenever you're working with data frames series all the things will be coming over here in case of data preparation. Now always remember guys while we are doing data preparation just before this in the data collection st big data engineer team also participate into this what they'll do is that after the data is collected they may store that in a Hadoop database or in a big data database right it may be stored in a NoSQL database also it may it may store it in a relational database also so it depends on the kind of requirements and based on the project itself now after the data preparation and uh we we try to do the exploratory data analysis. Now when we are moving to the exploratory data analysis, first of all we have to perform all the feature engineering activities. So in the data preparation what you'll be doing you'll also be handling null values. You'll also be trying to clean the data based on that. But when you move to exploratory data analysis now feature engineering will be involved in both the stages. In exploration exploratory data analysis, you will try to include statistical analysis on the data. Okay. Now statistical analysis helps you to understand the data and that is very important when you are solving any machine learning use case here when you're collecting the data you may be having million of records and how that data is basically there in the terms of visualization in terms of let me just take you an example in terms of probability density function in terms of histograms how the data is basically distributed and there are various other libraries like seaborn mattplot li which is used in this explored data analysis to understand the data And when we are doing the exploratory data analysis, some of the very important things like handling missing values, handling bad data, how we can basically handle it. Suppose if we have if we have some kind of data in our data set which is in completely in different way like let me just give you an example each and every records is basically having nan value. Suppose suppose one of the one of the record is having question mark. So these are some kind of activities we try to clean it during the data preparation. Again in exploratory data analysis we'll apply statistical analysis to understand the data how the behavior of the data is basically there. What and lot of diagrams lot of visualization diagrams will be created. Let me just give you some examples. You may use pair plot in seaborn. You may use box plot you may use vixar plot and different kinds of plots to understand how that data is basically distributed and after that evaluating and interpreting EDA results. Now as you know that each and every use case is bas there is a stakeholder basically responsible I mean involved in that particular use case right now you have to specify what you're trying to do suppose you want to handle some nan values you want you want to replace some nan values with something else right and that is only possible after you do the exploratory data analysis because you understand what is the distribution of data how the data is basically distributed so in order to replace that you have to provide this results why you're replacing ing with something else on what basis on what how was the distribution of the data and then you will you have to specify this results to the stakeholders to make them understand what you are doing in this particular step right so you evaluating and interpreting EDA results and apart from that sharing this results to the stakeholder so that you can move ahead to the next step which is called as model building and model testing and remember guys this three this three steps will taking more than 35% of your time of a project life cycle more than 35%. Now if I say that if there is 6 months of available for performing this particular project more than 2 months will be gone in this three steps itself because you have to clean the data you have to do exploratory data analysis you have to understand apply different statistical analysis on that data to perform some feature engineering. So all these particular steps will be done in this. Now and remember guys after doing feature engineering you also have to do feature selection in this because it will be possible that each and every feature will not be required for solving this machine learning use case. So you try to apply different statistical tools like statistical analysis like correlation pearson correlation uh extra tree classifier to understand which feature is most important. Okay we cannot take all the number all the features from suppose in my data collection I have thousand columns. Should I take thousand columns to solve this particular machine learning use case? No. Right? And the feature selection will understand which all independent features are directly correlated to our dependent feature. Okay? And there is a lot of other techniques which we basically applying feature selection also. After that when data is basically clean my data is ready it is present it may be present in a table format. I have all my independent features. I have all my dependent features. The next thing is that I'll start doing doing my model deploy model building. Sorry. Now in model building I will also perform hyperparameter optimization. So suppose if I have selected that I am going to apply XG boost and there is various ways to select which model will be better. Okay you can basically perform cross validation on multiple models. But just remember guys it also depends. I've also made a video which model to select for which kind of use case. Just have a look onto that video. It is already present in my machine learning playlist. So in model building we'll select one algorithm and that in on that also we'll perform hyperparameter optimization we'll perform cross validation apart from that we also perform some k-fold cross validation stratified cross kffold valid cross validation we'll try to find out the accuracy now remember only accuracy is not enough guys okay we also have to find out what is the confusion matrix what is the ROC score right what is the accuracy score all these things and we have to find out whether those accuracy is basically good or not. Now suppose just understand that I just missed one more point during this feature engineering process during this three stages we also have to handle the imbalanced data set. Now if our data set is imbalanced then it may impact the type of algorithm that I am using because my algorithm can get biased to one type of output. Right? So after that we do model testing. In model testing whatever what we do is that whatever test data whatever validation data we have done we have taken from this particular data set the real data set I will try to test it and see how is the performance of that particular model. Okay. Now after that once we see that the accuracy is good it is giving a good test result then we will go finally with the model deployment. Now for model deployment you have various tools. One is flask you should have you should know flask framework because this is very compulsory guys. Or suppose if you are trying to use docker, you are trying to use kubernet, you're trying to use a e2 instance at that time it is better that you know this flask framework this actually helps you to create a rest apis and that rest api can be consumed from any front-end application. I've already made videos how to do model deployment by using flask. I've not shown you in AWS but just understand if you know how to create a model deployment process in your local computer with the help of flask. It is the same thing. you'll just move that file to your a AWS EC2 instance or other instances. There are new and more different techniques that are coming with model deployment like cubernet and many more and that you have to explore a lot okay about that and finally after the model deployment is there we try to optimize the model how do we try to optimize the model we'll take take a threshold time like 1 month or 15 days and we'll continuously uh see the accuracy whether the accuracy is good or not with the with the real world test data that we will be getting after the model is deployed into the production and In short this many things the data scientist will be participating and in each and every step there is a whole lot of learning guys always remember and the most difficult thing will be the feature engineering part where you're performing in this three exploratory data analysis because you need to apply a lot of statistical tools and you'll get to know you get to learn a lot of things in this because you need to know how you can play with the data right so that is the most important thing and these all are the basic steps and finally Finally you do the model deployment and finally after the model optimization suppose your model is not giving good results then it may such happen that you again have to start the cycle again and it'll continue unless and until you get don't get a perfect model now it may so happen that in the future the data will again be changing right now suppose I have I have deployed one version one of my model but after some time it may happen that my data will be getting changed okay there'll be a lot of changes in data but the features will be same but there will be a different values for that particular data so again after 2 to 3 months I'm just taking as an example after 2 to 3 months again we create a next version of the model by the new test data that we have new collected test data real life test data along with this particular data and again we try to train the model we try to deploy the next version of the model and we always create different different versions so that it will help us to roll back if one version is not working well So in short these were the basic steps what do a data scientist do uh you involve in a lot of discussion with the stakeholder at this particular stage because this is very important guys and this will also help you to increase your domain knowledge because you should know what kind of data you require to solve any use case. Okay so this steps are very very important model building model testing is also very important. Yes, but I think this first four steps are very very important. Many people are comfortable in doing model building and model testing but in these steps they lack. Okay, so you have to be perfect in this particular steps. Again, if you want to be perfect in this, learn Python, learn pandas. Now if I'm taking it as an example of Python programming language always have a very good understanding of pandas. When you're working with pandas, know what is data frames, how you can basically work with data data frames in different ways. uh let it be series, let it be data films and many more. So this was all about this particular video. Um I hope you like this particular video. I'll see you all in the next video. Uh please do subscribe the channel, share with all your friends. Uh thank you one at
Original Description
Hello All,
In this video we will be understanding What Do A DATA SCIENTIST DO?
You can buy my book on Finance with Machine Learning and Deep Learning from the below url
amazon url: https://www.amazon.in/Hands-Python-Finance-implementing-strategies/dp/1789346371/ref=sr_1_1?keywords=krish+naik&qid=1560943725&s=gateway&sr=8-1
Connect with me here:
Twitter: https://twitter.com/Krishnaik06
Facebook: https://www.facebook.com/krishnaik06
instagram: https://www.instagram.com/krishnaik06
Subscribe my unboxing Channel
https://www.youtube.com/channel/UCjWY5hREA6FFYrthD0rZNIw
Below are the various playlist created on ML,Data Science and Deep Learning. Please subscribe and support the channel. Happy Learning!
Deep Learning Playlist: https://www.youtube.com/watch?v=DKSZHN7jftI&list=PLZoTAELRMXVPGU70ZGsckrMdr0FteeRUi
Data Science Projects playlist: https://www.youtube.com/watch?v=5Txi0nHIe0o&list=PLZoTAELRMXVNUcr7osiU7CCm8hcaqSzGw
NLP playlist: https://www.youtube.com/watch?v=6ZVf1jnEKGI&list=PLZoTAELRMXVMdJ5sqbCK2LiM0HhQVWNzm
Statistics Playlist: https://www.youtube.com/watch?v=GGZfVeZs_v4&list=PLZoTAELRMXVMhVyr3Ri9IQ-t5QPBtxzJO
Feature Engineering playlist: https://www.youtube.com/watch?v=NgoLMsaZ4HU&list=PLZoTAELRMXVPwYGE2PXD3x0bfKnR0cJjN
Computer Vision playlist: https://www.youtube.com/watch?v=mT34_yu5pbg&list=PLZoTAELRMXVOIBRx0andphYJ7iakSg3Lk
Data Science Interview Question playlist: https://www.youtube.com/watch?v=820Qr4BH0YM&list=PLZoTAELRMXVPkl7oRvzyNnyj1HS4wt2K-
You can buy my book on Finance with Machine Learning and Deep Learning from the below url
amazon url: https://www.amazon.in/Hands-Python-Finance-implementing-strategies/dp/1789346371/ref=sr_1_1?keywords=krish+naik&qid=1560943725&s=gateway&sr=8-1
🙏🙏🙏🙏🙏🙏🙏🙏
YOU JUST NEED TO DO
3 THINGS to support my channel
LIKE
SHARE
&
SUBSCRIBE
TO MY YOUTUBE CHANNEL
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Krish Naik · Krish Naik · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Natural Language Processing|Stemming
Krish Naik
Natural Language Processing|BagofWords
Krish Naik
Gaussian distribution or Normal Distribution in statisctics
Krish Naik
Natural Language Processing|TF-IDF for Machine Learning| Text Prerocessing
Krish Naik
Log Normal Distribution in Statistics
Krish Naik
Covariance in Statistics
Krish Naik
Confusion matrix, Precision, Recall| Data Science Interview questions
Krish Naik
Tutorial 44-Balanced vs Imbalanced Dataset and how to handle Imbalanced Dataset
Krish Naik
Implementing a Spam classifier in python| Natural Language Processing
Krish Naik
Tutorial 11-Exploratory Data Analysis(EDA) of Titanic dataset
Krish Naik
Face Recognition using open CV and VGG 16 Transfer Learning
Krish Naik
Pedestrian Detection using OpenCV from Videos
Krish Naik
Face and Eye Detection from Videos using HAAR Cascade Classifier
Krish Naik
Reading, Writing and Displaying images with Opencv| OpenCV Tutorial
Krish Naik
OpenCV Installation | OpenCV tutorial
Krish Naik
Face and Eye Detection from Images using HAAR Cascade Classifier
Krish Naik
Car Detection using HAAR Cascade and Opencv from Videos.
Krish Naik
Using OpenFace for Face recognition in Keras
Krish Naik
OpenPose Tutorial with Tensorflow
Krish Naik
Multiple Linear Regression using python and sklearn
Krish Naik
Dimensional Reduction| Principal Component Analysis
Krish Naik
Movie Recommender System using Python
Krish Naik
TPR,FPR,FNR,TNR, Confusion Matrix
Krish Naik
Precision, Recall and F1-Score
Krish Naik
Artificial Neural Network for Customer's Exit Prediction from Bank
Krish Naik
GridSearchCV- Select the best hyperparameter for any Classification Model
Krish Naik
RandomizedSearchCV- Select the best hyperparameter for any Classification Model
Krish Naik
K Nearest Neighbor classification with Intuition and practical solution
Krish Naik
K Means Clustering Intuition
Krish Naik
Create custom Alexa Skill- Lambda function- Part2
Krish Naik
Hierarchical Clustering intuition
Krish Naik
Implement Transfer Learning with a generic Code Template
Krish Naik
Gender Classifier and Age Estimator using Resnet Convolution Neural Network
Krish Naik
Unlock Your Application With Your Face using OpenCV
Krish Naik
Draw rectangle from webcam and sketch process it on a live feed
Krish Naik
Complete Life Cycle of a Data Science Project
Krish Naik
How we can apply Machine Learning in Finance
Krish Naik
Deep Learning in Medical Science
Krish Naik
How to switch your career to Data Science.
Krish Naik
Linear Regression Mathematical Intuition
Krish Naik
Handle Categorical features using Python
Krish Naik
Machine Learning Algorithm- Which one to choose for your Problem?
Krish Naik
DBSCAN Clustering Easily Explained with Implementation
Krish Naik
Curse of Dimensionality Easily explained| Machine Learning
Krish Naik
Feature Selection Techniques Easily Explained | Machine Learning
Krish Naik
Tutorial 29-R square and Adjusted R square Clearly Explained| Machine Learning
Krish Naik
Cross Validation using sklearn and python | Machine Learning
Krish Naik
Handling Missing Data Easily Explained| Machine Learning
Krish Naik
Deploy Machine Learning Model using Flask
Krish Naik
Deployment of Deep Learning Model using Flask
Krish Naik
How to Visualize Multiple Linear Regression in python
Krish Naik
K Nearest Neighbour Easily Explained with Implementation
Krish Naik
Predicting Heart Disease using Machine Learning
Krish Naik
Predicting Lungs Disease using Deep Learning
Krish Naik
Stock Sentiment Analysis using News Headlines
Krish Naik
Random Forest(Bootstrap Aggregation) Easily Explained
Krish Naik
Voting Classifier(Hard Voting and Soft Voting Classifier)
Krish Naik
Credit Card Fraud Detection using Machine Learning from Kaggle
Krish Naik
Hyperparameter Optimization for Xgboost
Krish Naik
Tutorial 45-Handling imbalanced Dataset using python- Part 1
Krish Naik
More on: ML Maths Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
10 Python Concepts You Must Know Before Calling Yourself Advanced
Medium · AI
10 Python Concepts You Must Know Before Calling Yourself Advanced
Medium · Data Science
10 Python Concepts You Must Know Before Calling Yourself Advanced
Medium · Programming
10 Python Concepts You Must Know Before Calling Yourself Advanced
Medium · Python
🎓
Tutor Explanation
DeepCamp AI