Visualizing Data Trends and Correlations with Matplotlib & Seaborn
Skills:
Data Literacy90%
Key Takeaways
Visualizes data trends and correlations using Matplotlib and Seaborn in Python
Full Transcript
Okay, I'm really sorry. Yes. So, uh, so today we are going to start with uh data analysis. We'll discuss about uh data visualization as well. Okay. But before uh going ahead uh we are going to have a basic discussion on whatever we have done in the last class. On top of that, we are also going to have a bit of a discussion that you might uh that you guys are saying like uh from where to get the exact notebook link and uh uh from where you can get the access of the recordings. So recording is something that is available on geeks for geeks main channel YouTube uh because uh this session that we are doing right now it's being live on YouTube as well. So make sure like you can go to the YouTube geeks forge geeks channel go to the live stream and you will get the live stream that happened yesterday uh there exactly okay so now talking about yesterday sessions Google collab okay I'm going to share the link here as well uh and I have shared it uh in in the last class as well but I'll make sure I'll share on YouTube okay so before starting with this session let me screen with you. I hope the screen is visible to you. Is it? And uh yes, this is an important part that I'm done. Okay. So, I hope the screen is visible to you. Okay, that is great. So, we have discussed in the last class about the flow of a data analysis project. how data uh how projects start. It start with an agenda, a problem statement, right? Then we'll find the key performance indicators. The key performance indicators are nothing but the questions that can be answered and based on the answer of the question, we can make the decisions, right? So we will find the key performance indicator that can fulfill my agenda that can fulfill the problem statement. Okay? So in the KPIs part, we usually have questions that has to be answered. Okay. And then we are going to collect the data like what exactly the data uh do we have or do I need to answer these questions based on the KPIs that we are having then based on the data that I have I'm going to explore the data right I'm going to collect the data then I'll explore the data in the data exploration part we have discussed that we are going to explore the data remove the unwanted columns null values removal, duplicate values, column exploration each and every individual and checking their data types. Okay, these are the basic steps that we have. Okay, on the extreme side we are we can also check the correlation mean that's more on the data cleaning and uh uh post-processing uh that we have but in general this these are the step that we have in data exploration. Then in the data cleaning whatever we have found that in the data exploration part I got to know like these are the things that has to be done on this column these are the thing that has to be done on this column. So in the data cleaning part we are going to clean the data okay and once my data is clean I'm going to analyze the data and once my data is analyzed I can build the visualization because this is the best way to represent the data. Okay. So this is basically a six or sevenst step process that we have and every step depend upon the step before that for example to visualize the data I need to have the data right analyze the data and to analyze the data first of all the data should be clean otherwise I cannot analyze it we have discussed it in the last class as well that's why we are cleaning the data and to clean the data first of all we need to know what exactly need to be cleaned so data exploration but to explor load the data. First of all, I need to have the important data. I need to have the data. So, collection of the data and what data has to be collected. There is on the internet is full of data, right? So, what data is important for you? For that purpose, you need to know like I want to get the data that can answer the questions that can answer the these key performance indicators and how I will find these key performance indicators based on the problem statement. So this is the flow of a data analysis project. After the data visualization part, you can either jump for the dashboard building depend upon the problem statement or go for deployment of those projects as well that totally depend or basically or either go for presentation building as well that totally depend upon the problem statement. Okay. But this is what we have in the last class. The flow of the data analysis project. Okay. And if anyone having any doubt, any question, any query, do let me know in the chats. Okay. I am going to take some time in between as well to answer your query so that all the things will be revised. Okay. Sir, please recap the previous session. Yes, I am going to recap the previous session as well. Okay. So now uh could you please share the notes? Yes, I have shared the notes in the last class as well. I'm going to share with you today as well. I'll make sure on YouTube I'm going to share in the description part so that anyone from directly YouTube uh seeing this video can also have a look. I hope all of you guys were a part of the last session that we have yesterday. Okay, those who are not the part of the yesterday's session and don't know uh what a data analysis how it works, it might be a a bit difficult for you. So I strongly recommend go for the last video and then uh then come to this video but if you have already been a part of the last video then that's totally fine. I'm little session please explain KPI. So KPI is nothing but the key performance indicator uh the parameters through which the problem can be solved. Okay. Exploration and cleaning. So in the exploration part we will basically check what has to be cleaned, what has to be done on the data, right? So in the exploration part we are just exploring it and checking what has to be done and the data cleaning part we are actually going to do it. Okay. Now uh these are some of the questions that I have okay that I am passing under key performance indicators right or the questions that has to be answered clear usually we have one specific kind of KPIs to answer the queries but here I'm taking different kind of query so that you will have different kind of taste so you can go for a skill-based assessment kind of analysis where all the question will be totally depend upon the skills. You can go for another agenda where all the things will be finding an overall like how many companies do we have total jobs, total companies, average salary, minimum salary, maximum salary and overall analysis of the whole data set. You can also go for any specific column of your choice which is for example location based right or experience based. So now I'm going to take in general couple of questions so that you will have the taste of all the different kind of thing that has to be answered. Okay. Now based on all these question now these are the question that will be that is something you will be having before having the data set right because these questions depend upon the agenda or problem statement. based on the agenda we'll find the key performance indicators or they are basically the question that has to be answered and once we have these questions now I know which data I will be needing I need the data of total number of jobs along with that I also need to have the companies associated to it I also need to have the job posting which particular job post they are hiring for I need to have it for the company name obviously I need to have uh the experience experience they are hiring for. I also need to have the salary column. Okay. And when I'm talking about top five companies, now how will you find top five companies? To find top five companies based on different factors, I can find top five companies. I can find top five companies based on their ratings. I can find based on their reviews or I can find based on a lot of other things as well. Right? So I need rating and review column as well. So in the second finding the KPIs part we are writing the question and based on these question we'll find what data is important. So that then I'll collect the data or basically I'll ask the company to give me the data. Once I have the data now the next step is I might have some unwanted data as well. Right? Because if the agenda is to find the answer of these question I will I might not be needing the columns like when exactly the job is posted. I don't need that. So I can remove that column. Right? So in the data exploration part we are going to explore the data and remove all the unnecessary kind of a thing that we have. After that I'm going to clean the data and analyze the data. And in analyzing the data these question are being answered. Okay. And after that we are going to visualize the data as well. And while visualizing the data there are lot of ways through which we can visualize the data. Okay. Now, okay, I think he's using a demo data set, a fake one. Okay, so first of all, this is not a de uh this is not a dummy data set. This is a real data set. Okay, so for example, if I go here and uh show you the data set. So inside the data set, I have dropped the data uh this particular data link. But if I show you the data frame. Okay. So if you look closely all these jobs comes with a nory.com job posting link. Right. And if you go and check any column of your choice. For example if I'm talking about the job link column. So these are the job links do we have. Let's take any of the job link. So this is the full link of the job. Okay. So you can copy it. Uh you can go to the next step and give it a try to paste it. Okay. Now this is an old data set. So some of the jobs that I have posted might be expired as well. Right? But this data is collected from dockery.com only. Clear. So yes this is not the dummy data. This is just want to show you. Clear. So now this is something that we have done. So I'm cloning this particular repository where the data set actually resides. Okay. So this is the link job analysis case. Okay. I was using this data set repo. Okay. So this is the repository from where I have a rep folder named as job posting and inside this job hosting I have jobs. CSD. So I'm cloning this repository. Once this repository is cloned, I am basically going to this data sets and job posting and after that I'm copying the path and I'm pasting the path here so that my data set is loaded. Okay, I'm also using a library that we have not used yesterday which is warnings. So warning is a library that will help you disable the warning. Whenever you are doing any cleaning there are some chances that in some of the cases you will get warnings. So this warning filter ignore will disable the warning so that the warning will not be displayed. Okay. And after that these are a couple of thing that we are done. We have checked the null values as mentioned in this. Right. So we are going to explore the data. Remove the unwanted columns checking the duplicates removing the unwanted columns and that is it. After that we have explored each and every column as well and got to know what exactly has to be dealt with. And in the data cleaning part we have written some quotes like this is the code that I have written to clean experience location rating and reviews column right and not only just replacing this right because as you know if I'm talking about just the reviews column okay so after executing this reviews column is not not cleaned yet right so if I'm going for df of reviews. This is what the reviews is basically storing right now. So your task is to clean the rating uh the reviews column as well and we are going to do that as well. So far we are just dealing with the null values part of it. Okay. So dealing with null values, dealing with duplicate, changing the data type of job ID, changing the salary column. Uh so this part has to be done. this part has to be done and along with that let me also write two points I may I'm going to clean the rating columns or basically review column okay so these are a couple of things that we are already done with okay now today uh I have two task pending for location review column is pretty simple uh the only complex thing is for the salary Clear and then we are going to answer all these questions. There are couple of answers that can be written as it is that is going to be just the number. For example, how many number of jobs but are the companies? Okay. Uh and there are going to be couple of things for which I'm going to plot the graphs. I'm going to make some visualizations. Right. So to make the visualizations I'm going to use libraries like mattplot lily pandas and seaborn uh and so on clear. So now going back to it and at the meantime if anyone is facing any difficulty do let me know okay collab link please. Okay it would be great if I share the collab link at the end of the session because I don't want to get guys to get confused in between. Okay I want all of you guys attention exactly here. Okay. So now the fun is let me run all the cells again so that all the things will be exactly a data exploration and in the data cleaning part this first two three four first four parts are done I've got the minimum and the maximum salary minimum and maximum experience. Now I'm going to clean the salary column. Okay. Now to clean the salary column first of all let's take the salary column just to so that we can check how exactly salary is. So here you can see we have the minimum salary we have the maximum salary and we also have cases where salary is not disclosed. So what exactly I need to do? I need to check right and uh salary is an important column because there is a question that totally depends upon average salary of a of a person based on specific years of experience. So if I want to find average salary of a data analyst of 2 years experience so I need to have this particular salary in numeric format in integer format. Right now it's in text format. So I cannot perform arithmetic operations or the kind of operations on this data set. Okay. Now let's see uh how I can clean this col for that purpose. Let's check what are the different kind of unique values do we have. So these are all the kind of unique values do we have here. Okay. So here you can see we have not disclosed. We have two values which is the left one and the right one. We also have two values along with that and included 20% variable part. So that's also something that we need to think how we need to port it. We also have something like this where I just have one salary rather than two. Okay. We also have single salary like 50,000 PA like 50,000 P. I don't know 50,000 PM means there it's like four uh 4,000 per month. Okay. So but we have these kind of things as well and along with that we also have salaries like 9.5 CR. How I can convert all these salaries into minimum and maximum salary so that I can perform the same kind of operation as I'm going to do on minimum experience and maximum experience. So that is something I want to check. Okay. So here you can see there are a lot of use cases. Okay. And I've told you earlier as well data analysis is 80% data cleaning. Okay. So you need to work a lot on cleaning the data because if the data is clean data analysis is pretty easy but if the data is not clean then it is going to take a lot of your efforts to clean the data. Okay. And in tomorrow's session we are going to use some LLM tools. not only LM tools, some AI tools in general that will help you analyze the data and then we will check exactly how efficient these tools are at this point of time that do are do these tools are that efficient that they can replace the job of a data analyst. We are going to answer it tomorrow. Okay. So we are going to use some LLM tools for that for that part as well. So right now here you can see there are a couple of them. If I want to write some of the test cases, there are a couple of them uh with just not disclose written. Okay, or not disclose. There are a couple of them where I I just have just the minimum salary. There are a couple of them where I have minimum and maximum salary. There are a couple of them where I have minimum and maximum with a variable. Okay, these are all the things that we have clear and not only that we also have the cases like any graduate. Now what will you do in this particular case? This is something out of the league kind of a thing that we have right. So now what I'm going to do I'm going to do a very simple thing. Okay. So most of the thing most of them are either a single number or two numbers separated with a dash. Most of them are either a single number. LLMs are large language models. Yes, we are. Yes, thanks Mi. Okay. Thanks N. Okay. So now we are going to see how we can clean it. Now to clean it the process something I'm going to use is first of all all of them are having PA this particular part common. Okay. So let's replace this part first. So, dot str dot replace pa dot with nothing. Okay. And I'm not going to replace just the space PA uh this PA. I'm going to replace space PA because it comes with space all the time. So, let's replace in this way. So, this is what I am having today. Right? So, this part is now removed. Now you can also go and check by performing unique what do we have exactly and you can see all the PA part is gone that's nice now what I'm going to do for now I'm going to remove this particular variable part as well because I know for now variable part is not that useful because it's just about minimum and the maximum salary if I'm finding the average or anything in that particular range because how will you evaluate valuate like the maximum be the variable part as well. So for that purpose so all these things which are written in variables are written starting with a bracket right if we have they are starting with a bracket otherwise we don't have it so let's do one thing do strlit I'm going to split all of them based on this bracket open and once this bracket open is being done and if I'm executing it right now it is not showing you anything Right. But if I'm going for dot unique dotsplit. Okay, this is what we have. But if I go for str of zero. Okay, so what exactly do we have? Now let me check the unique part. So now here you can see all the numbers that are coming after the space are removed. So what exactly being done? Let me show you just to give an example that if someone has the salary written something like this like variable 20% or something like this and this whole thing is a string. So what I'm doing I'm splitting it based on the opening bracket. So I will have two values. variable part is at the end and then I'm taking the zerooth index. So variable part for from all of them is now gone and that's it right. So here you can see now all of them means most of them are numbers. There are a couple of them with B tech and B and BCA also written but we will deal with them separately. That's the best thing that we can do at this point of time. We also have people who have written like something like this. Now what to do in this case? There are people who have written in the salary CA MBA PG DM and all these kind of things. Okay. So, cleaning this particular kind of data is difficult. Okay. But it's not difficult. It's just complex. It's a bit complex. But if you understand it, it's pretty simple. Okay. So now what I'm going to do and here you can see all the all the kind of cleaning that I'm writing here is something I'm doing in this one line. Okay. So now once this part is done, now the next step I can do is I can take now this dash and split it right so I can take dot strplit it based on dash okay so once I have splitted it I'll have two numbers okay this is the first one this is the second one right and based on the first one if I want to extract just the first one I will get zero okay not this particular zero this will take the zero throw dot str of zero so this will take the zero row of all of them. Okay. And in the exact same if I'm taking the first index, this will take the second row of it or I can go for minus one as well which is the last index. Okay, last index going for last index. Okay, for last index going for last index will not make much sense because uh it will take the last index in the exact same way as we are taking in the in the first index. But if something like this we are doing, I hope I'm clear to you. Okay, it might look a bit complicated what exactly we are doing. But I'll try to explain it step by step in in the first um in the detail part. Okay. So this is what we have if I'm taking the zerooth index and this will give you the df of minimum salary. Okay. So if I go for df of minimum salary and executing it, this is what we have. Let's check the unique part of it now. So in the unique part of it, as you can see, we have a lot of numbers now. It's looking a lot more clean. And here you can see the number of unique values are a lot lesser now, right? It still have some issues. I still have the issue like 9.5 CR and above they need to deal separately. So so does for any graduate so does for graduate not required so does for any post-graduate and these kind of categories but we can deal with them separately. Now we have a lot less categories. Okay. And not only that here you can see there are lot of dashes here. So let's do one thing. We can also go for strr dot uh strip so that extra spaces will be removed. And here you can see extra spaces are removed. Not only that, you can also write dot uh str dot replace and I'm going to replace these commas with nothing. So now this is what we have. Now the salaries are a lot cleaner. Lot lot cleaner. Right now what I can do I can take all of them and I can take just these categories that I don't want here. For example, if someone has written any graduate so I can easily say any grad I can take this category of any graduate and I can replace it with for example so far I'm writing all the things in a list for example there are people who are giving less than 50,000 as well. Now what to do with less than 50,000 that's also something that we need to think right so less than okay here we have less than part here okay less than 50,000 or basically graduate not required so I can basically make a list of all of them and once I have the list I can replace all of them to not disclose that's the best thing we can do so I will make one extra category which is called as not disclosed closed. So we can either have salary exactly there or we can have salary not disclosed right and here you can see in the Mtech MC and all these kind of scenarios it would be great if I'm categorizing them under non-disclosed rather than disclosing it. Okay. So now this is what we have. Okay. Now if I go back uh to this particular thing like so here I have basically written the whole code uh to analyze to clean just the salary column so that it won't be uh more complicated to you because there are a lot of categories that we have here. So basically cleaning the salary column this is the approach that I'm following. Okay. So what I'm doing first of all I'm doing the same thing I'm replacing the PA part replacing the comma with space and then splitting them with based on dash and taking the zerooth index. This is going to return you the clean salary. Then I'm removing the 9.5 cr part less than 5,000 part less than 5,000 parts. So if it is apart from that then I'm going to deal with it. Then I'm taking the clean salary uh and based on exactly the same thing I'm also finding the maximum salary. Okay. So you you can either go for this approach or this is a cleaner way of writing the exact same code. Okay. So if I'm executing it and showing you the whole data frame, this is how the data frame looks now. Okay. So this has the salary non-disclosed and minimum salary and maximum salary. Okay. So if I go for df of minimum salary. So this is the minimum salary column that I have. I can go for dot unique and this is what we have which is kind of the same as we have already. Okay. So now when I'm going to analyze the data I'm going to take only the ones which are not not disclosed and then I'm going to analyze it further. Okay. You can use exactly the same approach that I have tried as well. Or you can go for this kind of conditioning where salary not equals to 9.5 CR and above. So here you can write a very simple condition that where the minimum salary is not equals to any post-graduate or any graduate or you can also use a regular expression kind of a thing that can analyze the thing for you in a pretty easy manner. Okay. So this is how we are analyzing or basically cleaning this column. Okay. Is it possible to replace all the string values to not disclose instead of making them into a whole list? Okay. Is there any way? Yes, there is a way. The way is you can use a regular expression that will check if this is a number or not. If this is a number, that is fine. If this is not a number, convert it into not disclosed. So, you need to go for a for loop and then convert it in exactly the way that you want in exact category that you want. So it will be a for loop that will go for all of them and once it is going for all of them then you can go back and check then you can go back and check that if it is returning you true for example in this particular case the answer should be true so it should be the same thing in this particular case it should be not disclosed. So in this particular case it should be not disclosed or you can also use regular expressions okay which is an advanced concept that we have in Python or in data cleaning. uh so we don't have that much of time uh to give to regular expressions as well but I'll give you a glimpse of it when we are going to analyze the data so for now I have cleaned the salary column as well okay just to summarize we have cleaned the salary column as well now the last thing or the second last thing that has to be clean is the location column and in the location part I'm going to make the things a a lot le uh easier for you in the location part there are not a lot of cleaning required. Whatever cleaning needed in the location column we can do it in the analysis part as well because in one line I can write the code that will give you all the location. For example, I can write value counts here. So Bangalore is the one with most number of location. Hyderabad is the one with most number of location. But there might be couple of them with just Bangalore and just Bangalore. So in that case which one should be taken that's something we need we need to take care right and there are a couple of them for example here Gurugram Guruga so this one should also be taken care inside it right in the exact same way you might have uh like Delhi here you have seen New Delhi here so these category also should be taken care so I'm not cleaning this column for now I'm going to clean it in the runtime so when I'm going to analyze it I'm going to analyze it in such a way that cleaning will not be needed Okay. So this part is something I'm removing for now. But I'm going to clean this review column because that is important. And why is it important? Because if you look closely, I have a question related to top five companies. If you want to find top five companies, you need to know based on what you need to find the top five. Is it based on the rating, based on the reviews, based on the number of job postings, based on salary? probably what exactly needs to be done that's something we need to find right for that purpose review column should be in the integer format right now it is not so what we can do we can just check the unique values so that I can get a reference so all of them comes with reviews so the best thing we can do right now is take this reviews column dot str dot replace or you can either go for directly splitting them as well. Dot split and after splitting take zero index which will take this particular part right and after that convert it to the integer data type. This is how it looks. The column looks promising right? So now you can take it and this is how easy it is to clean the reviews column and that is it. Okay, now comes the data analysis part. Now we are going to write the code to find the answer of the questions that we have. Okay. And let's answer them one by one. Okay. Extract using reg x. Yes. Extracting you using reg x is the best thing you can do. Okay. At the end if you get some time I'll also show you the code to use how you can use exactly use regular expressions. Okay. So now let's write the code to find the answer of those questions. So what is the first question? The first question is to find the total number of jobs. So what are the total number of jobs do we have? Which code exactly I need to write? Okay. So let me break it. Okay. Let me add some basic details here so that I can collapse all the things. So projects loading the data. Okay. So now the first question is to find the total number of jobs. So how will you find the total number of jobs? Now there are multiple ways to find it. Now can I directly check? Now there are multiple ways to find it. Should I check the length of the data frame? This is part one or let this is part one where I will just check the length of the data frame and the second part I'll go for df of job ID column. and I'll find the number of unique values do we have here. Which approach should I use? Should I check the length of the data frame or job ID of unique or sorry it should be nunique so that I'll I'll get the answer in in the exact number. Okay, second one. Okay, now do we have any chances that both of the numbers should be exactly the same? Right? Now if let's do one thing. Let's check the length of the data frame. Length of the data frame is 72967. Right? And let's check the DF do. This is also 72967. They are the same, right? Why they are the same? Because if you go back in the data cleaning part when we are dealing with duplicates we have removed all the duplicate based on job ID. So all the job id should be unique right. So to answer the question number one you can use either of the approach. You can either go for length of the data frame or you can go for this approach. This approach will work better because but this one will also work. Reason being I have already removed the duplicate parts. So I can just print it for now. Total number of total jobs and here I'm printing it this way. Total unique jobs. So total unique jobs are this much. Okay. Now that's it. That's the answer of the first question. But this answer can be easily easily something that you can get if your data is clean. If your data is not clean, you know how much work you need to do. Total number of companies. Now, how will you find total number of companies? I want you guys to answer in the chats. What is n unique? N unique will give you number of unique elements. Thank you my m for giving the answer. Okay. So if I go for df of means the second question and the second question is top uh total number of companies associated with this. So this is total number of jobs and let's do one thing let's answer it here only and let's go for an overall analysis. Okay. So total number of unique jobs is length of the data frame. Total unique companies. How will you find the number of unique companies to find the number of unique companies? The fun should be pretty simple. We will go for df of company dot n unique, right? So df of company dot n unique. So the fun is this one. So total number of unique companies do we have at this money. Okay, you can make the formatting a little bit better. So this is how you can write it. And you can also do something like this which will make the things a little fun for you right. So let's go for 35. Great. This is how it looks. Total jobs, total companies associated to us. This is the number that I have. Okay. Sorry. So once I have that what is the next question you can directly jump into the next question that's how easy it is uh that's how easy analysis now okay now the question is top five companies now we need to find the top five companies now what you will do in this particular case if the question is to find the top five companies right because top five companies can be found based on different parameters you can found based on rating You can find it based on reviews. You can find it based on number of jobs that they have posted. You can find it based on the maximum salary that they are giving. So there are a lot of factors based on that we can find the top five companies. Right? So rather than finding the solution of this question, it would be great if you can go back and find a better KPI. Like rather than finding top five companies, go for a detailed part like top five company based on rating and uh based on reviews, based on salary and in salary as well. You can go for like minimum salary, maximum salary, average salary or number of job postings or number of locations, number of different locations they are hiring for or what right or based on the skill set. So this answer can be found based on all the different kind of factors. Okay. Now what I should do? I can go back and I will try to answer this question based on all the different parameters and then I will see which one is giving you the best answer. Okay. So now I will take all these things and I will see what are the things that we can do right. So now uh in the code part let's take now the task is to find the top five companies right. So I can take this DF which is the whole data frame that I have. If I want to find the company with maximum number of jobs listed okay so the fun would be pretty simple. I can go for DF and I can just go for DF dot company dot value counts and that is it I've got the answer association ukon and these are the numbers if your task is to find just the top five top five of them just write dot head these are the top five companies as per salary okay but we can uh not as per salary as per number of jobs right now we can find it based on the rating as well so let's do one thing now I'm going to find the top five companies based on their rating right now if I'm talking about for example DF of uh if I'm just showing you the data frame here you can see this is Accenture with the rating 4.1 this is Accenture rating 4.1 accenture 4.1 matter and these are the reviews that I have right so what I can do if I am directly sorting this data so I can go for df dot sort values there we also have a function dot sort values and here I can pass whichoever column that you want for example I want to sort them based on rating right so go ahead ascending is equals to false so now my data is sorted based on retail right so this these are the companies like bridge river alia campus sunda and all of them with rating five right and you can also go for head so these are the top five companies right now here luckily all the companies have the same I would say name right Now if you need to find the answer of the same thing based on reviews you can go for reviews. Now in the reviews part all of them are TCS. Why all of them are TCS? Because these are five job roles which are showing you the same length because all of them are the same. So what I need to do I need to basically use if you are into uh if you have done a little bit of work on Excel or SQL you might have a aware about the term pivot tables you might have aware about the term group by right in SQL we have group by. So I can use group by kind of a function which is going to make the groups based on the maximum of the reviews. Okay. So I can go for df dot group by I'm going to make the groups. Groups are going to be made based on the company right and once I have the company I'm going to take the reviews and what I'm going to take of the reviews maximum review or the minimum review of the company or average. Now I know for one company the review is going to be the same. So either I can take maximum or minimum or average it is going to be the same right executing it this is what we have okay now the data is not sorted yet so I can also write dot sort values and at the bottom we have TCS okay let's write uh sending is equals to false so these are the top five companies based on their reviews. Now let's let's crossverify what will happen if I'm doing it based on max. The data is exactly the same. What if I'm doing based on the average? The data is exactly the same. So as I've told you because one company is having just one single rating. Okay. So this is something that can be done. So I'm using a group by function which is going to take all the companies make individual groups of all the companies. For example, all the companies all the job posted by TCS will be there in a a single table. All the job by alliance will be again in a single table and then same for all the companies and then I'm going to take a review column of that in individual table and find the minimum value and minimum value or maximum value or the average value should be the same in the reviews column because this is what it is. This is how the data is stored. Okay. Now this is the this is the answer of based on reviews. This is the answer based on reviews again but this is the answer based on least uh okay most reviewed but here I have the issue all of them are TCS. So this is the better version of writing it. Okay. So if you want to find the top 10 top five companies based on rating these are the companies I have top 10 company based on reviews these are the companies that I have and top 10 companies based on value counts or number of job posting this is what I have. Now tell me which one you think is the most promising thing. If your task is to find the top five companies, which approach you should be using, are you going to use the first one which is finding the top five company based on the jobs or based on rating or based on reviews? Is it possible to have more than one column? Yes, we can have more than one column. Okay, so a lot of you guys are saying rating. Okay, but look closely. Okay, that's great that we we are having uh like people who are into rating versus reviews. But if you look closely, if I talk about rating, there are companies with five rating which are at the top. That's great. We have five rated companies. But imagine you have a product in front of you. You have two products in front of you or three products in front of you. Imagine you're surfing an Amazon and you are uh you are planning to buy a a drone. Okay. Uh an RC drone. Okay. And in an RC drone you have three options. You have a drone uh whose whose uh okay whose price probably imagine all of them are having the same price. One of them is having like 3.5 rating but 5,000 reviews. 3.5 rating, 5,000 reviews. Okay. Another one has four rating with 100 reviews. Okay. Another one having 4.5 rating, 10 reviews. And you have another one again with just 4.9 rating but just three reviews. Okay. So here you have. Okay. Let's make it a bit better. Okay. So now tell me which one you are going to buy. Okay. So usually uh what happen uh is rating is a good part but whenever I talk about the reviews reviews comes with the credibility. More the reviews more validation you have. There are a lot of people who have bought it. Okay. You might have found a sweet spot as well that rather than going for this particular drone I can go for this particular one with 400. But you are not going to buy a 4.1. And imagine if there is a fivestar drone with just one review. Are you going to buy it? No. It's not about the rating. It's about reviews. Reviews matters a lot more than rating. So because of that only if you look closely based on rating, I have companies with five rating. But just three reviews. Do you think this is worth this is worth it? No. Three reviews are nothing, right? Three reviews a company own employees can go and and make the reviews, right? But here if I talk about based on number of job postings or best part based on number of reviews, this will give you the best thing because here you can see the credibility the numbers are way higher. Okay. So this is why we are using we are going to use this particular approach. So to answer this particular question I have gone through different ways but then I got to know the best thing we can do is based on reviews. Okay. So top five companies if the task is to find the top five companies going for review is the best thing you can do. Okay. So you can go back and this is the overall analysis. But let's go back to the question number two where we are going to find the top five companies and this is 3.2. So and we need to find the top five companies. So we meant to perform the group by function. So this is the answer of this question. Okay. Now just to give you a hint a very basic hint that imagine this particular data is pretty simple. These are just number I don't need to plot it anywhere. But this data can be plotted. We can plot this data right. And if the task is to plot like top five or top 10 of them instead of just going head you can go for inside head of 10. It will give you top 10. instead of going for like if you want to go for top three this is what you will have okay but that's a that's a very basic thing but imagine if I want to find stop five but I want to plot a graph and any guesses which graph will work best in this particular case sir but reviews might be negative or positive reviews might be negative or positive that's an important part but there are lot of reviews you can also take a ratio between uh reviews Then uh means you can also apply a formula uh by by performing it on both rating and reviews by giving some weightage so that both the factors will be taken care. Okay. So lot of you guys are saying bar graph will work. Yes, bar graph will basically give you the best kind of a comparison. Okay. So if I'm going now how to plot a bar graph, you just need to write dot lot and just write kind is equals to bar. That is it. It has plotted a bar graph for you. Right? This is what data visualization is. No, this is not the data visualization. This is just level zero or level 0.00001 of data visualization. Right? There are lot of other things you can do with data visualization as well. But just by writing plot with along with your data frame you can give you can plot the data. Now you might be saying what if you want to go for a uh a pie chart here's a pie chart in front of you. Okay if you want to do it for first 10 jobs. Okay congratulations. Here in front of you the same thing based on the top. If you want to go again for bar this is how the bar graph will look right. So you can also plot the graphs in the exact same way. That's how easy it is to plot the graphs. But these graphs are not that great. These graphs are not very much appealing to my eyes. These graphs are not showing enough data. You can have these annotations. It would be great if we can have the graphs in exactly the same colors as my company's logo. Uh it would be great if I can interact with these graph. If we can grade right if we can have the annotation on top of each bar so that's something that we need to add and for this part only we have libraries like mattplot lilip we have libraries like seaborn we have libraries like plotly okay is it working without importing mattplot yes it is working without mattplot because a lot of dependencies of uh mattplot mattplot liib are already there in panda so if I imported pandas so that's totally fine it will work as well but full-fledged working if you want to do so you need to import matt project for that okay this is just to show you uh as we go ahead we are going to answer all these questions as well so if the question is to find the top five of them this is the answer okay eventually at the end we are going to build the graphs we are going to build the visualization part as well okay now first three questions are asked are are done right now the next question is finding ing the companies hiring for a data analyst role. Now, how will you answer this question? You need to find how many companies do we have hiring for a data analyst role. How many companies do we have hiring for data analyst role? What is the duration of the lecture? It is around 2 hours. So, 1 hour is already done. So, we'll be like 1 hour more. Okay. By probably 10, we are going to wrap it up. Okay. So now companies hiring for data analyst role. Now what the fun should be pretty simple. We need to go for all the companies which are hiring for data analyst. So job role should be data analyst. Right? So let's go for job role and let's double equals to and here the things are pretty simple in pandas. The things are very simple. Inside the job rule, I want to search for data analyst. Click okay. And here you can see all the values are false. Now why all the values are false? Because wheresoever you have data analyst written the value will be true. So you can pass this whole array of true and false to my actual data frame. So it will return you only the rows where data analyst is true. And here in front of you all the jobs of data analyst. If you want to know how many yes it is it is case sensitive. Now if you want to know how many jobs are there for a data analyst. So just check the length of it. So here in front of you are one okay 59 jobs for data analyst. Okay. But here we have an issue. Can anyone tell me what is the issue? There is a big issue with this data set. Okay. This something we have dealt in the last class as well. Okay. So the issue with this data set is if you go back and check the job rule column if you look closely in the job rule anyone can write anything right there is someone who have written bank uh branch banking right calling for women candidates Now if I'm doing this kind of search, this is doing the exact absolute search. So if someone has written only data analyst, then only you will get it. If someone has written data analysis, now you might be thinking how someone can be so dumb by just writing just data analyst as a job role because data analysis is something that a data analyst does, right? So imagine there is a job role of uh uh web development right or web developer is the job role right so if I'm talking about job role just want to give it a try do we have any job role for of a data analysis okay let's give it a try okay congratul ations we have COC general insurance hiring for data analysis. Okay. So communication skills uh Excel data analysis these are the thing that the person should know this fine but we also have people who are writing data analysis. There might be some people who are writing data analysis right like this. Okay. There might be someone who have who might have written data analysis with incorrect spellings. Okay, there might be something some people who have might have written data analysis in some other way. So that is something that need to be standardized and how that can be standardized by the person who is posting the jobs. Right? So right now there is no way uh other than going for a lot of if else condition or going for LLM model that can standardize all these job roles where I'm going to pass all the job roles one by one and it will standardize the job role. For example, if data analyst is the job role, it should be written as data analyst. Okay. So that is something that can be done. But for now, how many jobs are for data analyst? The simple answer is this one. Okay. Someone has mentioned how we can take both the columns of rating and reviews into the constitution. So imagine I'm writing a very simple formula. Okay. I'm taking the rating. This is the rating that we have and I'm taking the reviews. Okay, we call it as a weighted sum. Okay, so I'm going to take rating. For example, I don't want to give that much importance to rating. So I'm giving 0.01% importance to my rating. But I'm giving a lot of importance to my reviews. So.9% is something I'm giving to my reviews. Okay. And eventually I'm going to add both of them and that's it. It is taking rating multiply them by this. It's taking reviews multiply them by this and you will have a new exactly new resultant column. Right? If you want to change the rating or average or weighted average, you can do do that as well. And once that part is done, you can create a new column named as DF of rating review filter. And if I show you the data set, this is how it look. It has the new column which is rating review filter. So now what you can do you can take it right and uh now perform group by based on rating review filter and now these are the companies that I have which has both the components rating as well as reviews. Okay, you can change the weighted weights here rather than if you want to give more importance to rating. So just give more importance to rating and here you can decrease the importance to 50% as of rating. So this is called as weighted average or weighted sum. Okay, there might be a lot of other use cases as well which you can apply. Okay. Now coming back to this point, how many companies are hiring for a data analyst? This is the answer. Okay. Now what is the next question? The next question is skills needed for almost all jobs. Like what are the important skill sets? Okay. Rather than going for for all almost all the job, we can go for specific things as well. But the fun is pretty simple. Okay. Let me copy paste it. Skills needed for almost all the jobs. Now from where exactly you can get the skills? You can get skills and responsibilities column. Okay. So basically we need to do something on responsibilities column. We need to find which responsibilities are most common in most jobs. Which column? Which uh responsibility has maximum frequency right? Now how will you find this? So for that purpose if you look closely this the data of this these are the responsibility of first job. These are responsibility of second job. These are the responsibilities of third job and so on. And here you can see this is how it's written. So all of them are commas separated. So shall we do one thing? Let's take all of them dot str and let's split them based on com. So this is what we have. All of the job roles are now splitted. Okay. And now I just need to count how many time customer service appears in the whole data set. So this is how much important customer service is. I will find how much important sales is. So I need to count how many times sales appear in this. How many time relationship management appear in this? That's something I need to find. Right? So now I'm going to do something very very special. I'm going to use dot strlode function. Now what just happened? What just happened is if you look closely here you can see I have customer service then sales then relationship management and then so on and then product management and and so on and total we have 79 67 rows. So in every row we have list with multiple elements. Now by writing dot expplode I am adding all of them into one giant list. So in one giant list I will have customer service then we have sales and rel relational management and so on. So we can see sales relation management and then product management and marketing and we have total of 4 lakh 77,000 rows. So all the individual rows where I have individual lists are now added in one giant list. So all the elements of the first one in the first list then one above the other like below other and this is what we have. Okay, the fun is pretty simple. Okay, now what I'm doing if you go back and let me show you just dot head of it. So here you can see first three responsibilities are from first job because here the index is zero right and now the next one of them for example 1 2 3 4 5 6 7 8 there are eight different responsibility for getting job next job and these are the responsibility for getting the third job. Now all I need to do is take this column and perform just value count on this. That is that is what it is value counts and this is giving you the answer that how many times sales appear in this data how many time tides tide appear in this data agency communication training so skills needed for all the jobs. So if I'm writing so it would be better if the question is to find the top five skills or top 10 skills or top 20 skills. So in this particular case if the question is to find the top 10 skills here are the top 10 skills that will help you getting hired. Okay you might be thinking here it's written sales in capital someone might have written sales in all caps or all lower case or things like that. If you want to deal with that part as well, p is pretty simple. String dot lower I'm lowering all the characters so that all the mish happening will be done then and there only. So here you can see these are this is the data for sales. Earlier the data is if I'm not lower casing it it was 5 6 23 but after writing dot lower you can see means there are almost 1,400 job with sales exact sales but written in lower case or written in some other way. So you can also lower case and then check it. So these are the skill needed for you to get a job in general. Now because of this the data might be biased as well because in some of the cases this thing kind of this kind of thing happens that uh if in my data set if at no.com there are a lot of job for a specific job rule so these responsibilities will be biased right for example at geeks forge geeks if we have most of the jobs of tech rule so rather than sales the most popular responsibility might be python or DSA or or things like that, right? So this kind of responsibilities are biased as well that but that totally depends upon the data. Okay? So skills needed to get almost all the jobs. So these are the important skill sets. Okay. Now if you want to do the exact same thing for just the data scientist because we also have a question related to that skills needed to get hired at SDFC bank. Okay. So we can answer this question pretty easily as well because now we know the hook right so skills to get hired in SDFC bank. All we need to do is do exactly the same thing. Exactly the same thing. But only thing we need to take here is rather than taking the whole data set just take only the data set where company is equals to HDFC bank. Okay. I think the name should be something else. SDFC bank. It's written this way. So these are all the jobs of SDFC bank. Now what I need to do I need to go to their responsibilities and then do rest of the things exactly in the same way. So here if you want to get hired as an SDFC bank these are the these are the uh what I would say things you need to know these are the things you need to wear about these are the responsibilities you need to wear about and that's it right that's how it is because I've already written the code to find the responsibilities okay can I do the exact same thing like what are the responsibilities I need to have to become a data analyst just kiss. Why not? So, let's go back and this time the fun will be again exactly the same but rather than going for companies equals to SDFC here it would be data analyst and here should be job. role. So for a data analyst, data analysis is the responsibility that you need to have SQL, Excel, data analyst, PowerBI, Tableau, Python analytics, data management, these are the things that you need to know. And if you want to do the exact same thing for any specific job role of your choice, for example, so these are the things that you need to know to become a data scientist, right? And this is data back. This is what data back decision is right. So in the exact same way you can go for anything of your choice. Okay. And you can you can go for it. Okay. What is it counting? It is counting what are the different responsibilities do we have. Okay. It is counting what are the different responsibilities do we have for basically if I go for df of responsibilities. It is counting how many these values do we have customer services sales relationship. It is finding first of all how many unique values do we have and then it is finding how many time each value appears right and then because the moment we are writing with the value counts have we sorted them no we have not sorted them because value count is already returning the value in sorted format so that's totally fine if you are not doing it okay and now let's go for the Last question. Okay. Average salary of a specific years of experience that can be easily handled. Okay. And uh okay, let's do one thing. Let's do the visualization part now. Okay. Because next of the question for next question like uh we are going to write at the end for the code of last question. Okay. Now let me go back and now let me tell you how data visualization can be done. Now for data visualization you don't need to create a new module data visualization. Whatever data you have, you can analyze the data then and there only. For example, to analyze this data, the best kind of graph is dotplot kind is equals to bar. You can make a bar graph. It will work. Right? If the question is to find the top 10 of them, you can go for top 10 of them. That is pretty simple. Right? In the exact same way, what is this data? This data is just a specific number. just a very simple number. So because it is just numbers so it does not require any kind of what I would say graph to present right but to this point but for this particular case it is great if you can have a kind is equals to bar and here it should be dot plot right so this is what we have again a bar graph okay and if I want to skills to get hired at SDFC bank okay dotplot find is equals to bar. Again here bar graph will work because this data is something that is friendly to bar graph. Okay. But if you want to go for something else for examples to become a data analyst dotplot here as well basically the bar plot will go dot plot kind is equals to bar and here as well bar graph is going. But if you want to try with something else you can also go for kind is equals to pi or something like this. Okay, you can also go for donut kind of a chart. Okay, so this is how we are plotting a graph. Now let's see what is the importance of some specific libraries that we have. Okay. So as we have discussed we have like these are the KPIs the questions and the same questions are something over here as well because this is the question that we have okay that that we have answered. Now in the data visualization part we have tools or libraries that will help you visualize the data. We already have the data but rather than displaying it as it is in terms of tables I want to display them in terms of graphs and this is where data visualization is important. Yeah. And because this is where data visualization is important we have libraries like mattplot like seaborn like plotly that will help you visualize the graph. Okay. Usually it's uh very it's not a uh very popular that uh these kind of question are being asked in the interview. Okay. Uh okay. Don't we use analysis also using scikitlearn? Okay. There are some very specific use cases where you can exactly use scikitlearn some function from scikitlearn as well but mostly you will be using for visualization or analysis. These these are the thing that will work. Okay. If you want to go for the extreme and you want to find like the outliers in the data set or go for correlation kind of a thing then you can use cycle learner. Okay. SDFC bank are not as same as the general skills required. What is the skills needed as SDFC bank are not the same as the general skills required. Yes, a lot of things are common but there are lot of things that are specific to SDFC bank only. So I'm analyzing the data of SDFC bank. Okay. Can we please explain explore function again? this okay so in the explore part what exactly happening is I am taking I hope till here everything is fine I'm just lowering it and splitting them and in the explode part what I'm doing imagine if I'm taking all these elements so these are the elements of the first list right now these are the elements of the second list. These are the elements of third list. Okay. And so on. Yeah. So imagine these are this is the data that we have right now. Rather than writing all the data in this particular manner, I'm going to do something different. I'm going to take all the data and adding it one on top of another. So earlier this is the data that I have for multiple lists. Now I'm going to create one giant list where I will add this data and then take this data and add it in the exact same list. Okay. And then not only that I'll take the third row as well. So I have combined first two rows right in the exact same way I combine the third row and third row is this one. So I'll take this data of the third row add it here. Okay. Not only that I'll take the fourth row, fifth row, sixth row and all the rows and combine it into one giant list. So now I need to just perform value count on this list and it will tell you how many unique values do we have and how many times each value is appearing. This explore function will help you do exactly the same thing and look closely. First one is customer service, then we have sales, then we have relational management, then we have product management, then we have market analysis and then you will have like chain management, agile and all the exact same thing. So if I go for head of 20, you can match them. They are exactly the same. Right? So now the fun is pretty clear. This is how explode function work. This is what explode function is. Okay? Explode is in python. Yes, it is in python. Not directly in like core python. We are using some libraries. It's in pandas. This is a better thing to say. Okay. Now uh let's see how we can customize these graphs. because this graph that I've just shown you is not a very beautiful graph. Let's be realistic, right? So, this graph has some problems with it. Okay? And usually whenever uh you are going to make a graph, for example, this data set for is for is for no.com. So, best thing you can do is uh you can go back here and check noy dot noy logo. Okay. So if you go to images you will get the noy.com logo. Okay. So from here you can see this is what the images. It's fine. So what I can do you can take the colors from this no.com logo and then you will try to add the same kind of logo in your code. That's the best thing you can do. Okay. So these are the graphs for example that you can see here. Here you can see at the left we have written reviews. You have not written how much percent of area it is occupying. Not even percentage. It's not also not mentioning the numbers. So this thing has to be added. Right now I'm going to tell you how you can learn more about these libraries that will help you customize this data. Okay. There are two ways to do this. First way is you can go back to the documentation of the library. Okay. And once you are at the documentation of the library, you can see what function exactly we need to call to add different colors. What function we need to add to add data into it, to add percentage into it, to change the font of the percentage, to go for two decimal place or three decimal place to change these phone colors. This is one way. Go to documentation means if here I can go for my plot lab documentation. Okay. And this is where you reach mattplot lip stable. This is documentation. And here you can go for whatever different kind of graphs that we have. Here we have the cheat sheet as well of mattro lab that you can take the reference from which will tell you like which function we can use for which particular thing. So this is a quick uh revision or basically quick access to uh whatever you want to do. And here you can also go for directly uh to the plot types. It will give what are the different kind of plots do we have or you can go for a bar graph. Okay, like this time I won't talking about bar graph. So this is the code that will help you plot the data in a bar graph. I'm importing mattplot. I'm not using the default uh means the data that I have. I'm creating the data by myself using numpy. Okay. And then using ax.bar I'm passing this is the data x and yaxis. This is the width of each bar. This is the color of each bar edge. This is the width of each bar. And this is where exactly x-axis limit, y-axis limit. And at the end, this is what it is displaying. Right? So this is the basic thing. This is first way of doing it. Okay? And now let me tell you the second and a smart way of doing it. Okay? We have not discussed about it uh so far. But if you go back here we can see Gemini. Okay. So Gemini is already integrated here. Best part of Gemini integration is you can directly write the questions here. Right? For example, this is the code that we have. Okay? Rather than passing it here, you can write like give me the code to plot a bar graph. But okay, rather than going for a bar graph, let's go for this one to go for a pie chart with matt plot lib. So it will give you code. Okay. And let's give it a try. So here you can see this is the code that it has returned you. Okay. So let me copy and paste it. And here you can see what exactly it has done. If you look closely have I told it that this is the code that I need to take and on this particular code you need to make the graphs. No, I've just written give me the code to plot a pie chart with mattplot layer library. That is that is it. So basically this model this geminis is very intelligent. It is intelligent enough that it is going through all my codes and it is checking where exactly I have made a plot. Not only the plot but a pie chart and it is using the same code. For example, it is taking the group by the exact same function that I have written but rather than writing dotplot kind is taking this particular thing. This is going to store means if I'm not writing this this part is storing the data frame. So this company reviews will have this particular data frame with it and inside it it has defined a figure size 8 by 8. So the graph will be made a perfect square. Then it is taking the label will be taken from the labels exactly these are the labels. The autocity will tell you how much this decimal place you want to go with. I want to go with one decimal place. So 1 start angle is 140. So basically the text will be written. So basically the colors will be starting from 140°. Changing it will change the just the the angle of it. That's it. This is going to be the title. Xis equals plot show. So let's give it a try that how it looks. Now this is how it looks. Is it better than this? Means better than this. Yes, it's a lot better than this. Not only in terms of uh in terms of size and text, but text is also a lot better. Definitely, there is a issue that this is uh like superimposing it. It would be great if the text can be colored in a in a different manner. But yes, this part is going fine. Okay. Now, what I can do, I can change this angle. For example, if the angle is 90°, this is how it looks. If the angle is probably 0°, this is how it looks. So you can change it based on the thing that you want. Uh changing it to for example TCS is the number one, right? So it would be great if it is at no TCS it at number one. So it would be great if I'm going it for 90 probably. Yes, it it is great if I am going with 120. Okay. So if you want TCS on top so this is this is what you can do. Okay. Now here if I want to change it to for example 4x4 the graph is shown to 4x4. Okay. If I want to make it 6x6 I think 6x6 would be a good figure. This is it. Okay. Now this is what we have so far. If I want to change the color of it. If I want to change this text. So let's go one by one. Okay. Now here I'm writing two to now there are two ways as I told you either you can use Gemini or you can go back to the documentation and check using documentation is a time consuming task Gemini will do a lot of things for you pretty easily instead of Gemini you can also use charg but charg will give you a basic code you need to modify the code so that it will work in your case but it is working in my case okay so now here I can write code to modify the colors of the graph and go for a darker or basically a a blue shade. Why blue shade? Because I know the logo of no.com is blue, right? Can it has given me a code and uh most of the things are pretty simple, pretty same. We have the company views. This is an extra column that is added, right? Which is of colors. And here colors equals to color is an extra thing that is added. Okay, that's fine. So, let me replace everything and let's see how it looks. Now, this is how it looks. Okay, now uh this is a bit more classy. Rather than having a lot of different colors, it's having the same color range. Now definitely it would be great if first three of them should be of or first two of them should be of a dark should be of text or basically what I would say the annotation should be in white color so I can modify it. Okay. Now I am writing modify the code to go for bite color annotation for largest three values. Okay. So now here you can see it is modifying the code. It is go for company views the exact same thing. It is defining the colors and now this is trying to add the colors. Okay. So let's do one thing. Let me add a more version here. Let's execute it to check how it looks. And here you can see this is working right. And all of the things are are exactly the same. Just first three of them are of white color because of probably this code. This is finding the top three and this is changing the color of them to white. And by default all of them are black. So first of all define all of them as black and then change the color to white. That's it. Right now if you want to modify them further. So here you can see this the this is level one. There was a level two. This is level three. This is level four. Okay. Let's add some more layers into it. Okay. So now I want to go with Okay. Change the font of the whole thing. font of off it and go for something classy. Okay. Or you can also go for write your own font that you want and it will find that if that font is available or not. So it will be uh used in the exact same way something classy and minimalist. Use larger font size in annotation. Okay. As well as uh we also have company names, right? Company names. Now, how are companies names written? Company names are written as labels. So, we call them labels, right? So, you can also write company names or labels. So, now going ahead with it and let's give it a try what it is doing. So, here you can see it is writing the code for you in Mattplot Lab. doing the exact same thing. Finding the colors this time going for a phone size 12. This is the phone name it is taking. This is taking the adjust the distance of annotation from the center. So how much distance it should be here? So now if I'm copying it and go for a new version of it. Sure. And let's give it a try how it looks. Now this is how it looks. Okay. Now annotations are a a little outwards rather than at the center. It is giving you a lot of warnings. So it would be uh you can also ask it to disable the warnings. So it will disable the warnings for you. Okay. But right now Georgie Gi Georgia is not found. So I think the font for family I should I cannot use the Georgia font. I need to go for a default font only. So I need to remove it. And let's remove this particular part as well. So that the font will be gone. And here you can see this is a bit better version of it. Okay. If you want to go for even more analysis, for example, if you are into data analysis and if you have already worked on these graphs, you might have known there's a function called as there's explode part here as well. This is not the basic explode that we haven't that we have done there. But this is a different explode. Explode top. Okay. explode them as per the area they are acquiring rest everything should be same okay and you can see just for writing it quickly I'm just messing with lot of things in terms of spellings and everything but this is the code that it has written let me copy and paste and show you what exactly it is doing okay and it's not mandatory that it will work here you can see it is not going for checking it but here you can see if I'm executing it this is what it has okay now it is exploded from the region okay how much it is exploded now that's also to totally up to you if you want to explode just of three of them and rest of them should be as it is you can modify it go back and modify it as it is so this is the text that I have written So here I can write exclude only first three of them and rest of them should be as it is. Okay. So once this part is done so here you can see it is modifying the code to exclude only first three and then now it should be top three rather than going for best three. But let's see if it can do it. And here you can see it is doing it. That's nice. So all I need to do is just disable this particular one and here I can go for list different point own families just list that are there in mattplot okay so here you can see these are some of them that I have so let's give it a try if we have sans serif working here so here you can see serif font is working Right. Can I use sans serif? Yes, I can do that. Can I use monospace? Monospace is something that might look better here. Okay. So here you can see I think the issue is we don't have okay I have changed color to monospace. It should be font name to monospace. Okay. So this is how it looks now. Okay. So that's totally up to you which font you want to use. Okay. So graph look promising now I don't think so there is any issue with the graph only thing that I need to change is change the title size okay so now I can write change the title size and make it bigger okay so now this particular part which is displaying you the title pltitle this particular part with phone size 16 this part I think if I'm just changing the phone size to 25, it will work and it is working. That's great. So, I don't even even need to go back and check it and change it. Okay. But definitely it would be great if I can have some padding here, right? So, add some padding between the title and the graph. Okay, that is important otherwise like it will superimpose on it. Okay. And now here you can see whatever code it is generating, it is still generating based on Georgia. So the best thing you can do is this is the final code that you have executed. Take it and you can write modify it to add some padding or space between title and okay so here I have already given it the code. So what will happen? It will take this particular code and try to modify it in the way that I want. So now if I go for this particular function. Now this is doing a lot of things on its own but this is the final graph that you have. Okay. If you want to change the font size of the annotations as well here you can see this is the phone size. Let's go for 20. This is how it looks. Okay. And in this particular case this particular the top 10 companies here I change it to 30 then. So it will look a bit better a bit bigger and you can change the size of it. Okay. So this is a good thing. This is great. Right? So this is the final graph that you have. Okay? Or let's go back to the same structure. This is the final graph that I have. If you want to add some extra space on top of it. So you can also ask it like which particular part of it is adding the extra. So here you can see this is the part which is adding the padding. So if I'm adding the pad 30, so it's added some extra pad. And let's go for 40. And this is the final padding that I have. So this is the final graph that I have. That's great, right? So this is what I have. This is the first final graph that I have made. Top 10 companies by minimum reviews. Okay, by minimum reviews. Okay, I think I need to change the title. So the company the question was and if you look closely what are the versions that that we have executed this is what I have got in level one by just writing plot kind is equals to pi this does make sense it's not like this is showing this is not showing you the data it is showing you the data but here you can see how much better we can make it we can make it a lot better right and we can also still play with colors. We can still add a lot of other functionalities as well. Okay. Right now I can see like these graphs are not interactive. Okay. So now what I'm going to do I'm going to copy the whole code and this time I'm giving to Gemini. I'm make it interactive and you can see whatever library I'm using so far I'm using just mattplot library. Right now I'm going to use interactive with seaborn. Okay. Now whatever code that I have written is now going to be replaced with bad plot. Okay. Uh sorry C bond. So it is installing a new library which is MP3 to enable the notebook to interact. Uh so I need to install this particular library and do uh the additional part. Okay. But that's fine. So let's give it a try. So this the code that it has and this time. Okay. Okay. Okay. Okay. I need to modify it. I need to give it again. Okay. So Gemini this the final code. modify this code to do the same thing in mattplot limb uh sorry the same thing in seaborn okay eventually this is a time consuming task for sure but this is not that technical this is not that analytical main thing uh that where you will invest most of your time is in your or most of your brain is in analytical skills in analyzing analyzing the data in cleaning the data. Once your data is cleaned, once your data is analyzed, you have the data with you, right? Rest of the things can be easily automated or easily done with LLM tools. They can do a lot of help for you as as you can see here, right? So, they are helping you in doing a lot of things. And here you can see now it is using seabbond library to do the things. So, I'm just copying and pasting it just to see what I have got. So this is what I have. Okay, which is not a lot better than uh a lot better than the other ones. But here uh okay let's let's go for plotly now. Okay. So basically just to give you the structure how data visualization or basically what what is the flow of it in in terms of visualization. First of all first step is to just get the things done. Get the things done. Get the analysis done first of all. Once the analysis done just show the graph. Okay. and show the graph. Basically just to check which particular graph will be suited best for which particular type of data and display them using just like dot plot the way that we have plotted it. Right? Once we have plotted it then once all the graphs are plotted in just this particular way then we need to finalize the visualization format. And in finalizing the visuals take any visual of your choice. Take any visual. Okay. And in taking any visual, select the colors, theme, theme, font, etc. Everything. Okay. And once you have found it, then you need to go for some advanced libraries like Seaborn and Plotly to plot the graphs to finalize it. Okay. So usually it's not like I will start with I start with plotly or C bond. If you're starting with plotly or Cond will like you will be missing a lot of your time. So best thing is first of all get the things done. get all the things answered and just plot them with dot plot and after that just spend some of your time to perfect just one graph with one specific colors and then try to mimic almost the same thing here. Okay, now I'm writing it. So for example, do it do it in plotly with interactive graphs. Okay. And just to give you example like which one you should use. So level one is Matt plot lip okay which will give you decent kind of customization. Level two is seabboard which will make graphs a lot better than mplot lib and the final is plotly okay plotly will help you make the graphs interactive. You can also export the graphs as an HTML file and share it to anyone. It will also we also have a library named as das. there is dash or dask I uh I need to check again right but this library will help you build dashboards as well I think it's dash or task okay you can give it a check okay so this is the ideal pipeline so basic mattplot library will give you basic visualization you can play with colors play with transparency and lot of things se will add some extra layers of modification that you will do plotly will do a next level thing okay it will help to interact with the graphs as well. Okay. So for example, okay. So if I'm copying and pasting it just to see what it is doing or how much it is doing right now. I think I have copy pasted a wrong one. So here you can see this is what the data it is showing and all the colors are now definitely the colors are messy but if you look closely these graphs are interactive this graph is interactive okay now what I can do if you don't want to go for this you can also go to chat GPT and ask it to use the exactly the same structure okay now what I'm going to do I'm going to take the exact same code that I have written this is the final code okay and here I'm asking Gemini Okay. To modify this code in uh plotty to make it interactive. Okay. And here I can mention make it exactly the same in terms of colors and you can mention all the other things as well. So you can see now it is adding all the functionality that I already have here into the plotly. And now let's copy and paste it to see how it looks exactly. Okay. So this is how it looks. Uh we have TCS, we have Reliance Industries, all the data is written inside it. If you want to write it outside it now you know what to do. You can ask it to modify it to do all the other things. Okay. And it will it will do the things for you. Right now it is using the default blue theme. It would be great if you can go back and uh check the exact color code, the exact hash code of the company's logo. You can extract those hash codes and then you can get the things done. So basically this is the time to go for all the five level of things we have. Okay. Now this is level one graph that we have made using just plot. In level two I have added some functionalities into it. Okay. by adding the title and all all these level of things by changing uh by changing some basic functionalities. Now I want to change these colors as well for for top three. So this is how it looks. Now if I want to change the graph the size and adding some more details into it and all of them are centered towards the left and and and something like this. This is also something that's going fine if I go for it. So here you can see I've exploded it. That's totally up to you if you want to go for this one or that one. Okay. And then I have added some some more details into it. And finally, this is the graph that I have finalized. And then I can also convert it into the interactive mode. Okay. Why I love interactive mode? Look closely. I can select any column of your choice and that data will be there. Right? So this is what do I mean when I'm saying the graphs are interactive. Okay, these graphs are really interactive. Right now what I can do if I want if right now I don't if I'm not uh trying to go with this interactive kind of graph for example this is something I'm going with. So I can copy this and it would be great if I'm using CHBT now. Okay. So let's go for [Music] charg. Okay. So here I have given the whole code. Okay. It is added somewhere else. So here I have given the code. Okay. So it is searching on the web and analyzing it. But on top of that, I know that I'm going to stop it and I'm going to take this code, give it to JGBT. Okay. And I'm writing analyze the this structure color font theme size and everything about this graph. Okay. And now just look closely what I'm going to do. And this is the graph I want to make in the exact same format. Right? and add all the detailings in the graph that the code shown below. Okay, simple, right? So, I've added the already written structure. I have added what I need to do, right? And the same thing I need to apply on this particular code. Let's give it a try what exactly it is going to do. Okay. So it is doing a comprehensive analysis. So it is making a pie chart. This is the structure layout. This is the color scheme. Blues. Okay. On choices. And then it started making the graph. Okay. So have I copy pasted this graph? Okay. The graph that I have copy pasted is of CON again. Okay. That's totally fine. So if I go to charge GPT this is the one that it has copy and pasted. Let's go back and give it a try what exactly it has done. Okay. So responsibilities column okay is something that we don't have because it's taking incorrect spelling. So this is the graph that is make. If you're not happy with this particular graph which I am not because here you can see this particular color is pure white and this particular case colors is dark. Usually it should be the opposite right. So let's go back and check here uh if I can directly convert it from here or I need to ask GBT to do it. It would be great if you know how you can manipulate some of the things from here otherwise you will be totally dependent on charge GBT. That's the that's the worst thing that you can have. Okay. So you can see bars responsibility count index the values and the color palette is blues length of responsibilities. Okay. So X label these are printing the ones. These are rotating the axis. These are creating the bar chart and responsibility making the bars. These are the values and the higher the value lighter the color. So if higher the value lighter the color this is something that is giving the colors and that is color is based on the length of responsibility okay if I'm writing negative of the length of the responsibility okay figure access okay responsibility count this is this is a different kind of a palette we have okay so if I have light blues or if I need to change the color a little bit Okay. Okay. So, let's do one thing. Uh on charge GBD, I can attach the attach the images as well. So, I can take an image of it. Take a screenshot of it. And let's give it back to JPT and see what it's saying. Right. You cannot analyze the file and image for free. Okay. I need to just give me a second. Let me open the whole thing for you. Okay. So now if I'm opening this particular JGBT here the things are pretty open. I'm doing it in temporary mode. And now if I'm taking the screenshot of it and give it to RGBT and I'm writing act as a professional data analyst and review this graph. Okay. So it will give you a detailed review of this graph. Analyze the graph to responsibility frequencies. Okay. High commission demand balances tiers. Okay. It's analyzing it. Okay. And you can also also mention it analyze in terms of UI UX which color should be used. So it will give you the feedback. Okay. So once it has done it now I can give the code modify this code to up and update it as per your feedbacks. Okay. Now, so far I have not even read the feedbacks, but I just want to know uh that can I blindly trust this at this point of time. So, you can see uh it is writing the codes. Okay, I need to wait for it and I will take a sideby-side comparison and then like we are going to write wrap it up. Clear because rest of the codes you know how to write. So now if I go for something like this just to see what exactly it has made. Okay. So top 10 responsibility and frequency and this is what we have. Is it better? The answer is definitely yes. It's a lot better. Okay. Uh is it uh obeying the rule where I have mentioned like you should use the exact blue color? Not right now. But undoubtedly this is way better. This is way better than this one. Okay. So here you can see from this one to this one to this one. That's it. Right. That's how that's how interesting is it it is to use LMS. And here I'm not using my analytical skills. Okay. Now to summarize whatever we have done if I go to the sky draw again we have discussed again with the flow of a data analysis project. This time we have KPIs. So we try to find the answer of the questions. And to find the answer of the questions, we have cleaned the remaining columns. We have analyze the data and we started visualizing it. While visualizing it, first of all, just get the things done like this one without even plotting it. We have the data. Then just plot it and see which particular kind of graph work best and do it for all of them. Once we are done with it then go for which color you are going to use and basically start finalizing the things and in finalizing the things go for this kind of scenario where we'll start from level one try updating all the graphs and go for a best fit and whenever we go for a best fit just change the data change the colors change the the the font and make it uniform and then go for whatever thing that you want and here you can see this the final graph that I have made right And this is the final graph that I have made for this one. Now you can modify the same thing and I'm giving it back to JPD. Do the same for this one and let's wait for it. Okay. So here you can see this is the final code that it is writing and uh once that part is done it's annotating each and every bar as well that obviously that I want I want the data to be there displayed so I can go for the code cells execute it and this is the second one top 10 responsibility as SDFC bank by frequency you can change the title of it but it can be a lot better and the Same thing you can do it here. What are the things or skill set you need to have to become a data analyst. Okay. And then the exact same thing that we have done like for his DFC bank. These are the overalls overall data and you can go for top 10, top 20 and so on. Okay. So this is the notebook that I have finally. Now what I'm doing I'm sharing this notebook. Okay. I'll go to YouTube and I'll share this link as well in the in the first comment of this. Okay, I'm copying this pasting it here and uh in this particular part I think the notebook link is already there. Okay, what's the need of adding the link here? Sorry, I was acting dumb. So now this is the final collab link that's opened to you. So all the update that I have done so far are here and uh this particular final finalized diagram I'm sharing it again. So I think there will be a new link that it will generate this time. I'm copying this link and this is the drawing link the updating drawing link and as told you like this is the final link and link you guys can get me over here. Okay, tomorrow is going to be the last lecture of the series uh where we are going to finalize this project. We are going to finalize in such a way that we will add all the remaining remaining graphs. We will add the documentation of it and make it look like a professional project end to end professional project. We'll add the documentation of it. We will upload this particular project on our GitHub. Okay. And on our GitHub, we are going to add all the remaining parts of it. We are going to heavily use LLM models, CH GPT models because otherwise it will take a lot of time to write all the things manually. And I've told you if you are writing all the things manually sorry it's it's it's uh not good for this point of time because there are a lot of people who are using LLM tools you need to know the game that's for sure but if you know the game and still using LLM tools that's that's a bad thing you can do that you are doing right now okay share the theory also okay for the theory is this skele uh okay let me add it here So I've shared this link here as well and let me share the LinkedIn as well. Okay. So tomorrow class will be an end to end finalizing this project and then as I've told you we are also going I'm also going to tell you what are the online AI tools that can help you analyze the data end to end without even writing any code end to end code without writing anything okay so we'll also discuss the plus and minus of it uh and uh just we'll see what exactly can be done uh the link of it is already sharen shared here but let me share it again. So thank you so much everyone for joining in finally able to take dinner for so long session. Yes me too. I'm also going to have a dinner now. So thank you so much everyone for joining in and uh I'll see you tomorrow uh at the same time. At the meantime if you have any doubt make sure to copy and paste the same code on chat GPT ask it to step by step divide it into multiple step and see how well it is performing. It will it will help you in a lot of things. But still if you have any doubt first five minutes of tomorrow session will be dedicated to this one. Okay. So thank you so much everyone and uh I'll see you in the next class. Take care everyone. Bye-bye.
Original Description
Register for free to attend more workshops on Full Stack development, Data science, AWS, Devops & DSA: https://gfgcdn.com/tu/UJ6/
📊 Unlock the power of data visualization with Matplotlib and Seaborn in this tutorial! Learn how to explore data trends, visualize correlations, and summarize your data using these two powerful Python libraries.
✅ What You’ll Learn:
How to create a wide range of visualizations with Matplotlib and Seaborn
Exploring trends and patterns in data using line plots, scatter plots, and bar charts
Visualizing correlations between variables with heatmaps and pair plots
Summarizing and interpreting data with various visual techniques
🎓 Level: Intermediate to Advanced
Perfect for: Data scientists, analysts, or anyone looking to enhance their data analysis skills with powerful visualization techniques!
👍 If this tutorial helped you, don’t forget to like, comment, and subscribe for more data analysis and visualization content!
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from GeeksforGeeks · GeeksforGeeks · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
How I got into Walmart | Shailesh Sharma
GeeksforGeeks
Upgrade yourself In 29 Days | GeeksforGeeks
GeeksforGeeks
Learn AWS Fundamentals For Free
GeeksforGeeks
Conversation With Young Achievers | Meet the winners of Bi-Wizard Coding Contest | GeeksforGeeks
GeeksforGeeks
Meet The Winners Of Bi-Wizard Coding Contests | GeeksforGeeks
GeeksforGeeks
Interview Prep Strategies | PayPal
GeeksforGeeks
OLX Interview Preparation Strategies | Hukam Singh
GeeksforGeeks
Meet Some More Winners Of Bi-Wizard Coding Contests | GeeksforGeeks
GeeksforGeeks
Live Mock DSA
GeeksforGeeks
Microsoft Azure For Absolute Beginners
GeeksforGeeks
Python for Data Science | Data Science Master Bootcamp | Arpit Jain
GeeksforGeeks
Getting Started with Data Analysis | Data Science Master Bootcamp | Ashish Jangra
GeeksforGeeks
How to prepare theory subjects for SDE interviews | Geeks Summer Carnival 2022
GeeksforGeeks
Get Your Tickets To The Geeks Summer Carnival | GeeksforGeeks
GeeksforGeeks
TED Talk Data Analysis Project | Data Science Master Bootcamp | Ashish Jangra
GeeksforGeeks
How I Secured AIR 9 in GATE'22 | Tushar
GeeksforGeeks
Learn Java Backend Development | Geeks Summer Carnival | GeeksforGeeks
GeeksforGeeks
How to Recognize which Data Structure to use in a question | Geeks Summer Carnival | GeeksforGeeks
GeeksforGeeks
Learn Data Structures and Algorithms | GeeksforGeeks
GeeksforGeeks
Interview experience at Flipkart | GeeksforGeeks
GeeksforGeeks
Lets Prepare for GATE'23 the Right Way | Sakshi Singhal | GeekSummerCarnival
GeeksforGeeks
Highest Paying Jobs in 2022 | Ishan Sharma | Geeks Summer Carnival 2022 | GeeksforGeeks
GeeksforGeeks
Geeks Summer Carnival 2022 | 5th April- 11th April | GeeksforGeeks
GeeksforGeeks
Preparing for SDE interviews | Soham Mukherjee | Geeks Summer Carnival 2022 | GeeksforGeeks
GeeksforGeeks
Full Stack Development with React & Node | Utkarsh Malik | Geeks Summer Carnival | GeeksforGeeks
GeeksforGeeks
Introduction to Open Source and Roadmap to GSOC 2022 | Geeks Summer Carnival 2022 | GeeksforGeeks
GeeksforGeeks
Web Scraping in Action | Geeks Summer Carnival 2022 | GeeksforGeeks
GeeksforGeeks
Getting Hired at BITCS via GfG Job Portal | Get Hired With GeeksforGeeks
GeeksforGeeks
How to build a faster landing Page | Geeks Summer Carnival 2022 | GeeksforGeeks
GeeksforGeeks
Geeks Summer Carnival | 5th To 11th April, 2022 | GeeksforGeeks
GeeksforGeeks
How to get ideas for Startup | Geeks Summer Carnival 2022 | GeeksforGeeks
GeeksforGeeks
Journey from Tier 3 to JusPay | GeeksforGeeks
GeeksforGeeks
Geeks Summer Carnival 2022 | GeeksforGeeks
GeeksforGeeks
Dispelling Myths and Pre conceptions of Programming Languages
GeeksforGeeks
Must Do System Design Questions
GeeksforGeeks
Understanding Sorting Techniques in an hour | Keerti Purswani | Geeks Summer Carnival
GeeksforGeeks
Get Hired at NEC | Job-A-Thon 8
GeeksforGeeks
Journey from Tier 3 college to Microsoft | GeeksforGeeks
GeeksforGeeks
Get Hired with GeeksforGeeks at SuperK | Job A Thon 8
GeeksforGeeks
GeeksforGeeks: Redesigned
GeeksforGeeks
From Tier 3 to cracking multiple interviews | GeeksforGeeks
GeeksforGeeks
Live Mock DSA
GeeksforGeeks
Youtube Data Analysis | Ashish Jangra | GeeksforGeeks
GeeksforGeeks
DSA Self-Paced Course Preview | Sandeep Jain | GeeksforGeeks
GeeksforGeeks
GATE Live Classes | Prepare for GATE CS 2023 | GeeksforGeeks
GeeksforGeeks
Journey from JIIT to Adobe
GeeksforGeeks
Life Is Unfair Ft. Shonty badmash | LIVE Discord Session | A GeeksforGeeks Exclusive
GeeksforGeeks
Interview Experience at Google | Tech Dose
GeeksforGeeks
Live Mock DSA
GeeksforGeeks
Interview Experience @ Amazon | GeeksforGeeks
GeeksforGeeks
My journey through the tech world from India to US | Vidushi | GeeksforGeeks
GeeksforGeeks
Complete Interview Preparation Course | GeeksforGeeks
GeeksforGeeks
Live Mock DSA
GeeksforGeeks
Getting Hired at FiftyFive Technologies | Job-a-thon 9.0
GeeksforGeeks
GFG Karlo, Ho Jayega | GeeksforGeeks ft. Khaleel Ahmed
GeeksforGeeks
How I got job offers from 2 big companies : Arcesium & Microsoft | GeeksforGeeks
GeeksforGeeks
LINUX for Beginners | GFG x Itversity
GeeksforGeeks
My interview experience at Walmart | GeeksforGeeks
GeeksforGeeks
Get Hired at Speckyfox
GeeksforGeeks
Live Mock DSA
GeeksforGeeks
More on: Data Literacy
View skill →Related Reads
📰
📰
📰
📰
Data Science with AI — Join IDSA Janakpuri Today
Medium · Data Science
Stop Writing Python Classes Until You Learn The 4 Things You Can Do To Every Piece Of Data An…
Medium · Data Science
Why I Stopped Trying to Predict Electricity Price Spikes (And Built Something Better Instead)
Medium · Data Science
Why I Stopped Trying to Predict Electricity Price Spikes (And Built Something Better Instead)
Medium · Python
🎓
Tutor Explanation
DeepCamp AI