Youtube Data Analysis | Ashish Jangra | GeeksforGeeks
Key Takeaways
End-to-end YouTube data analysis process shared by Ashish Jangra
Full Transcript
hi everyone do let me know if everything seems fine audio the video to let me know in the chat if there's anything missing okay so great we are live now uh welcome everyone this is going to be a very very special stream uh a reason being because we are about to hit 500 000 subscribers on youtube and today uh to give you like how many subscribers currently we are having so i'm having this ads with me so here you can see uh we are having 45 more subscriber to reach and we'll reach 500 k subscribers and the there are a lot of announcement uh a special message from the man himself sandeep sir and after ending of this session like uh from 7 p.m we are also going to have the session of sandeep on teaching uh dsa that's the pre-recorded session we are also having that will also will be shared at the end so do let me know in the chat if everything seems fine okay because uh like we are about we are going to hit this 500 000 subscriber mark on live so and let me tell me you tell me what do you want to know this time this is going to be a different kind of stream i'm having some plans but if you are having a better plan to discuss then obviously i am up for it as well and this stream by default is going to be on the data science part on data analysis and we are hitting 500 000 subscribers on youtube so we have designed the stream in such a way that we are going to discuss about youtube analysis we have created a data set for all the videos that we have created so far on youtube geeksforgeeks channel all the it's about 2100 videos we are having so far so we are going to uh i'm going to tell you the process how the data set is created and after that i'll tell you how to analyze the data set okay data analysis also going to done on that part so do let me do let me know in the chats show me the enthusiasm and then let's start the session so before jumping into anything else i am sharing my screen with you do let me know if the screen is visible to you and let me tell you the data set do let me know and if the screen is visible to you do let me know everyone so here you can see this is the data set of geeksforgeeks.youtube you can find this data set link in the description as well it is our it is already shared okay and once you can see here about the data set what we have in the data set so here you can see this data that is of 1.79 mbs here we are having the youtube link here we are having the thumbnail links video duration title of the video this is the the very latest data set but i would say okay mic is too low okay let me increase it i'm hoping it is it's it is better now is it better now okay that's that's fine okay great so uh here you can see we are having the data set of 1.79 mbs and here in this data set we are having 2091 videos that we have uploaded so far on geek's week's main channel so if i'm going to the youtube slash geeksforgeeks or what i'm searching let's suppose gfg youtube i'm going to the first line so you can see this is our channel we are about to hit and here you can see uh the big surprise is waiting for you we having special gift special giveaways and these giveaways will be announced once we reach 500k there are still like if i'm uh sharing the screen with you again uh we are uh we are very close we are just uh we just need 41 more subscriber and we will we will reach there so once we reach there uh we are going to have the special surprise to the top 500 students uh means the first 500 student we are having the special offers as well and here you can see we are having all the videos okay so here you can see this is going to be the premiere as i have told you it will be premiered at seven which is the premiere premium course of sandeep jan teaching data structures and algorithm two hours team will be there and this is going to be the paid course that we are giving it for free which is two hours so make sure to watch that stream as well so here you can see we are having the list of videos that we have uploaded if you scroll down we are having the videos on all the domains means on in different kind of tech technologies and as i've told you we are having total of around 2000 videos and in this particular data set we having the data of all those videos we are having the youtube link of the video so let's suppose if i am copying this link and pasting it here executing it so here you can see uh the ad is here okay remove that part from here but okay so once that will be done i i i'll show you the video that what the video is about okay so it will take so take one second okay now it's done oh yeah okay let's uh we'll skip that part as well so here you can see we are having uh uh the videos let me skip it and here you can see this is the video that we have streamed on 11th of april how i get hired by gfg portal and this video is if i'm scrolling how i get hired this is the video that i was stopping talking about so this is the first video and after that if i'm going to this next video here you can see this is the video this is the thumbnail link if i'm copying this link and pasting it here here you can see it will open the thumbnail of the video as well this is the video link this is the thumbnail link of the video and after that if i'm going back here the duration of the video 16 minutes 56 seconds you can just watch the video as well 16 minutes 56 second is the duration of the video and after that this is the title of the video these are the number of views these are the number of likes we are also having the number of comments when it is streamed and here is the description of the video this is the level of data we have collected how we have collected it that's something we are going to discuss in some other stream because that's a total different part but that is also going to be a really interesting one okay so but we will do so far so here you can see i am having this data set now on this data set i am going to perform data analysis okay so first of all let me tell you how we have created this data set okay to create this data set we have used something called as a web driver which is selenium okay selenium is a web uh i would say web scrapping uh tool which will help you to scrape the data from the website and these are going to be specifically the website where i cannot directly send the request and get the data because there are a couple of website where we have uh loaded the preloaders like if i'm scrolling the screen then new videos will come up right by default if i opening a video opening let's suppose the link of youtube of geeksforgeeks so by default we will have 30 to 40 videos let's suppose if you're scrolling down then the new videos will be added so direct sending request and getting the signal is not possible so what can be another approach the approach can be we will use web drivers but how i will use it i'll open uh the selenium web web driver and after opening it i'll scroll down and in this page i'll get all the videos that are uploaded so far okay once i have opened all the videos so i will go to the inspect part of it for example if i am going to the inspect element of it ok i am just telling you the approach so if i am going to inspect element of it so here you can see there are so in let me change it and inspect it so here is the one if i am scrolling down so here you can see in this div named as content we are having all the videos okay in this div content we are having all the videos and inside this device bill i am having another tag which is this one which is a default tag in youtube and in this tag we are having each and every video displayed so if let's suppose i want to know about this video i am going to the inspect part of it here you can see here is the link of the video if i am going up and up so here is the data if so id is dismissible if you don't know how i am working on it so we are already having a live stream uh on so audi is too low okay i'm hoping uh it is clear or it is better now so here you can see uh in the inspector part we are going and checking the inspect portion of it and here you can see we can inspect where we are having all the video so here i have opened it in this ytd iphone section hyphen list render we are having all the videos if i'm scrolling down in this section i am having all the videos okay so what i will do is to send i'll uh first of all scrap the data of this whole page and after that i'll scrap only this particular area where i am having all the videos okay once i am having all the videos area then i will go to each and every video how i'll go to each and every video for that purpose i have discuss about the dismissible that we have just found and it is here so here you can see we are having a div tag with id dismissible and this is the first video then i'll go to the next one here you can see okay it is vit interact this okay this is the first video this is the second video this is the third video fourth video fifth video sixth video and list goes so on and unless we are having the list of all the videos once i am having this i will go to each and every module and i will find the detail that you want to find for example in this module if i am going here you can see in area label i am having the title like dsa sell paste course giveaway here is the title we are having in href tag we are having the link of the video okay and after that these are the two things i'm having you can also scrap the timing and views directly from here but once you are having the video link so from this what you can do you can scrap the data of all the video links okay once you are having all the video links that we were having here once you are having all the video links then you can go to each and every video page on auto mode using selenium web driver and you will scrap the data from it simple so this is the process that i've told you we are going to do the implementation of it but not in this video and now let's analyze it okay what are the things that you uh that we can analyze from it let's discuss it now so for that purpose this link is already shared with you and now i'm going to create i'm going to the home page and from here i will open the same notebook and i'm going to write the code here so this is the notebook that we are having and let me open the notebook here so here you can see i'm opening a new notebook and this is where i'm going to write the whole code for it okay so if you don't know how kaggle works so kaggle is a platform for data science or data analysis or means whatever data related problems we are having and in this platform you can have multiple data set and you can perform different operations on it we are having a lot of active competitions here as well that you can take part okay and here i can change the title of it and i am changing the title of geeks for geeks and here i am writing geeksforgeeks a youtube analysis okay so video analysis that is done and once that is done so here you can see in this link i am having dfg.csv which is the data set okay so once i'm having the data set with me and you know the panda is something that i have imported earlier and why i imported it because this is the main library that we are going to use okay so what i do i am i am using pd dot read underscore csv to read the csv file and here i will give this whole path of the csv okay once the path is given i am saving it in a data frame executing it and if i am writing df dot head here you can see the data frame is in front of you how do you do okay is it fine now the audio part of it do let me know if audio is clear to you okay so here you can see i'm having if i'm printing df.head i'm having all the videos that are displayed that are there in the data set okay so what should be the next step the next step would be i'm having the video link i'm having the thumbnail link i'm having the duration i'm having title i'm having videos number of likes number of comments and basically all the things i am having okay so what should be the next step the first step would be first of all i'll check if we are having any null value in the data set or not okay so let's check if we are having any null value in the data set or not so now what i'll do i'll write is null okay okay let me zoom the screen as well a little bit so i'm hoping it is fine now so if i'm writing df dot head so this is the data frame and if i'm going for let's suppose uh new code cells df dot is null so this will give you a list of true and false why this is giving you a list of true at faults because uh it will return you a variable at whichever index i am having a null value that will be taken as it is otherwise the otherwise if there is a null value it will be return fall true so here you can see i am having false because this is not a null value here i am not having an l value that is why it is writing at as false so what i will do i write dot sum of it because it is not feasible for me to go manually to each and every link and check so now if i'm executing it here you can see i am having 205 missing thumbnail links okay that is possible and after that 20 titles are missing 20s views are missing 33 or 33 videos comments are missing because it is web scrapping so we are scrapping all things so it there are high chances that it happens so what i will do once i am having the null values i know there are a couple of null values in the data set so there are two ways to remove it either i can remove those rows where i'm having null values or i can fill some data in place of it okay so here i am just filling some data into it okay or i'm just removing okay let's drop an a if i'm writing df dot drop any then what will happen it will remove the rows where i'm having the data so earlier i was having how many rows if i'm printing the data frame earlier okay let's print data frame so earlier i'm having 2091 rows okay earlier i was having 2091 rows but now i'm having how many rows 1854 rows only so where the rest of the rows goes because we are having couple of null values so what i have done i have removed those rows where i am having null values okay so now number of rows will be obviously lesser now this data frame where i am having lesser number of rows i am saving it into the original data frame and now if i'm checking the head so this is the data frame how it looks it looks exactly the same no changes obviously here but the changes can be seen when you are writing is null again so here you can see if i'm checking the sum of it earlier i was having a lot of null values i am having null values and thumbnail links i am having null values in title and views and a lot other thing as well but now i am not having the null values why i am not having the null values because i have just removed them okay so once the null values are removed now my data set is ready to go okay now i can start working on the data set now how or what i will do with the data set first of all video links are something which is not that important for me but let's discuss about the title okay let's analyze the title first so for that purpose i am taking the title column okay so there are a lot of things that can be discussed here a lot of things that can be found as an output now what i will find i'll tell you what if if you want to search all the videos of sandeep jan or if you want to search about all the videos that we have published so far on python or data science or any topic of your choice how you can find all those videos now we are going to discuss about that part so let me switch to the screen and here you can see i am having the data frame of title i have uh scrap the title part of it okay so in the title what i can do i'll i can go for for i in this title and if i'm printing i this will print all the title one by one that's the first video that we are having now this is a nine days old data set so obviously we are not having the very updated videos that we have uploaded last week but this is the data that we are having here we are having all the videos that are updated so far now from these videos what i want i want the specific videos on the topic that i want so what i will do in that case i'll go to each and every row means i'll take the data frame i'll go to each and every row i'll check in the title if i'm having let's suppose if i want to search the video of python so i'll go to each and every video and i'll check if python is there in the title or not if python is there in the title i'll print that title simple so here what i am doing i am going through each and every title okay and here i am writing an if condition if and here instead of i let me call it as okay let us take it i only and here let's write title is equal to let's suppose i want to search for python now here i write if i or if title in i or not if title is in i only then i'll print i that means what will happen if i am executing it here it will print all the titles where python is mentioned so here you can see python for data science is one video where i am having the title mentioned right so here you can directly see the output the output is literally in front of you all the video that we have published on python let's count the video how many videos are there on python okay so i'm writing creating a counter as c is equals to zero here i'm writing uh every time we are executing it we find that so i'm writing c plus equals to one and at the end let's suppose if i want to print how many videos are there so there are total 111 videos whose title in whose title i am having python okay now one more thing can happen what if i am searching python with small p and execute it there are zero videos in python in that case but in some of the cases it might happen that uh okay let's try if we are having some the same thing on c okay c is just let's see c plus plus so there are a lot of videos on c plus plus there are a lot of videos on okay there are zero videos on c plus plus so what will happen sometime let's suppose if you are searching for videos of let's suppose ashish okay so you are searching the videos of ashish but in the title it's written as ashish in that case what will happen both of them won't match so how i can get rid of this problem to get rid of this problem i will go for dot lower i lower case this and i lower case this title as well that user has entered so even if the student has write written ash ish it is still true ash i sh it is still true so same case i can apply here as well i lower case it and in the title as well i can lower case the things so if i am comparing both of the things so here you can see these are all the videos whose title in whose title i am having c plus plus and now what should be the next question you can ask your own question in the chats as well if you want to know you can ask your own question in the chats as well and do subscribe the channel do share this video with your peers we are about to hit the mark we need just 35 more subscribers and once we reach there i'll give you the announcement okay so here you can see we are having uh all the videos uh whose title is title i'm having crc plus plus or java or whatever title of your choice okay so i'm creating a markdown here and i am writing finding videos of a title of a specific title simple now what should be the next thing the next thing might be i need to find the video links i don't need to just find the title of the video because otherwise what will happen you need to manually go to each and every title search on youtube and after that you need to learn now what if you can find the video links on specific title that would be great right so i'll do the same thing finding the video links on a specific title how i can do that the process is going to be ah almost the same but there are some changes some tweaks that we need to make okay and the tweaks would be uh i'll go for i'll take the same data frame okay and in the data frame so to overcome this problem we are having two ways i am going to tell you the first way so this is the data frame that we are having okay and once the data frame is here i am going to convert this data frame to values why i'm converting it to values okay this might be an important question but i'll give you the answer to it as well so i'm going to give you the naive approach right without using sorry without using any inbuilt function how you can do that okay without using any in-print function how you can do that so shlok is having a question what's the name of the data set so the name of the data set is geeksforgeeks youtube it's on kaggle okay so taron is having question how i can search the video title c plus plus and java together and i use pipe for the creator both give me the kind of results on youtube api yes you can use youtube ap like as i've told you there are multiple ways to do the same thing okay i'm telling you my approach of doing the thing if you want to search about both of the titles or all three of the titles you can go for and operation or operation in between right here only you can write multiple titles you can write here and operation like and operation is not possible obviously because you cannot have c plus plus and java in one video there there are going to be very very less videos where you're having c plus plus in java both obviously the chances are very less so what i will do i take this here i'm having the title the number of videos displayed now once i'm converting into a array if i am writing df dot values the list that i was having earlier is now converted to an array and if i am checking the shape of the array this is of shape 1854 comma 9 why 54 comma 9 because earlier i was having 2090 videos and out of those 2090 videos i have removed the rows where i'm having null values so i am having only 1850 videos now i need nine columns as i am having nine columns as well so if i'm executing it these are this is the 2d list i'm having okay or 2d array so this is something i'm saving in a data variable okay so in the data if i'm going for data of 0 this will give you the detail of first video this will give you the detail of first video just go to the video and here you can check this is the first video that we are having right you can go to the thumbnail of it this is the first video we are having in the data set but if i want to go to the next video here is the next video you can see let's go to the thumbnail of it and this is the next second video that we are having let's go to the let's suppose 100 video here's the video link right so here you can see we are having the data with us okay now once the data is clear to you what should be the next step i will go through each and every row for i in data and i'll print i and here i'm breaking it so that because i don't want to print all the uh the details of all the videos i just want to print the details of one video okay so here you can see i'm getting a list and the list is having eight or nine elements based on the number of columns we are having i just don't remember it okay so now what i need i need to find the video links of a specific title so where is the title title is at zeroth index i am having video link first index i'm having thumbnail link second index i am having the duration and in third index i am having the title of the video so if i am writing i of 3 and execute it i will have the title of the video so i of 3 is nothing but the title of the video now here what i will do i write title is equals to i am writing the same title as suppose uh python okay so here you can see the title is there now what i will do here i write f the title is equals to is equals to or if i would say f title is in that particular title of that particular index then i'll print the title then i'll print the whole title of the video okay now let's execute it and check so i'm getting the same output here as well but this time i am taking specifically title column directly but here i am not directly in the title column but i am taking index wise okay i am taking it index wise the operations okay once that is done what should be the next step now i want to print the video links and because i want to print the videos link so how i can print it to print the videos link that uh the means i would say process will be same rather than printing i of three i'll print i means whosoever's title is having python i'll print the all detail of that particular video i'll print the youtube link i'll print the thumbnail link here you can see the data science bootcamp that's also we have done so here is the duration here is the title here are the number of views here the number of likes here we are number of comments when it is streamed all the data is there okay so once we are having this data from this list where can i get the video link the video link is at which index the video link is at 0th index so what i write i write i of 0 or i can also write title comma i of 0 now if i'm executing it okay it should not be title it should be i of 3 which will be the whole title so here you can see here the title of the video and here is the link of the video so here we are having all the videos on python if you want to learn let's suppose the python programming tutorial tutorial right so let's write here the whole thing pile cell programming and here as well you can lower case it and lower case this title as well so that the comparison would be easier and here you can see here are all the videos on python programming tutorial simple so all you need to do is go to the video click on the video hopefully add won't play but it is not possible so here you can see the video is in front of you okay python programming tutorial so this is how we can search the title specifically based on the title okay so so far what we have done we are having the data set we are loading the data set we are removing the null values and we are finding the videos on a specific title okay and after that rather than printing only the title we are also printing the video links this is also something we are printing okay so sorry so what should be the next step once i'm having the video links with you now you tell me what are the thing that you want to specifically know from this data set do let me know in the chat what are the things that you want to know from this data set i'm uh uh reading the comment from asus kumar can we write conditions like final all the videos with c plus plus by title exclude one having title arrays okay so uh is saying announcement so announcement will be shared once we reach 500k and uh these are the number of subscribers we are having now we are still having uh we still need 36 subscribers so you can share this video when with one of your peers because the announcement uh or what i would say the scholarship that we are providing or might be on the test series there are a lot of things that has to be discussed will only be shared once we reach 500k so and this is specifically for the first 500 students it is also not for all the students so this is going to be a great opportunity on a very large uh user base for a very large user base but the announcement will be done once we reach 500k so we still need 35 subscribers so what you need to do you can share this video with one of your peers at the end we are also having this special message from the man himself sanita uh about the future plans of geeks for geeks and how the things are how the things are doing what are the things that we are doing so far so do share this video and uh yes let's let's make it 500k as soon as possible so that you will get the announcement easy uh soon as well so i just missed the question of ashutosh so so can we write conditions yes we can write the condition based method as well finding all the videos of c plus plus yes we can we can do that as well like first of all we will define the condition so ashu is saying the question is let's suppose if i want to know about all the titles uh where the title is having i am having c plus plus but i want to exclude some title as well so i want to search about let's suppose c plus plus i but i don't want to uh know about the basics right or the fundamentals title should be there so you can add this should be there and they should not be there it is simple like like we have mentioned here as well title in so instead of when you can write not in but should be in and what should not be okay so let me share the screen as well so frame from here you can see if title in so in this original title if we are having this keyword or not if we are having this keyword then only i will print the whole detail if it is if i want all the videos in whose title i am not having python programming tutorial so now you will get all the videos if you are having two titles like this is something that python should be here in the video okay but i don't want uh tutorial or i won't i let's suppose i don't want lists to be there in the title so in that case i write this should be here okay and also i also want let's suppose uh in the title itself or let's suppose rather than writing title i am writing a list is also should list also should be in i of three dot lower okay if both of things are there in the title only then you will execute it and here you can see we are having four videos on python and list first video is working with the lists in python second one is python programming tutorial and here you can see we are having list slicing written next we are having converting list of characters into strings in python python programming tutorials python programming tutorials again okay so these are all the videos if you want to know about specific if you want to exclude it like and this should not be in this so now if i'm executing it here you'll have all the videos where uh the title is not written whatever you want okay so uh the videos are about python but not about the list okay so we are having more question from share the link of the data okay so link of the data set so it is already given in the description the link of the data set is already in the description you can find it uh the description of the video named as data set link it is already there okay and yes so what i'm saying can we step ahead or further with the number of views in descending and descending order okay that's also an important part so let's jump into it and let's do that so now we know how we can find the link on a specific topic let's suppose uh if i want to know about selenium so we should be there we should be having the video on selenium as well so here you can see we are having one video on on selenium as well so this is how we can search on a specific title now we are going to what uh we are going to find uh or work on the views part okay uh or i would say total views on all the videos what are the total number of views we are having in all the videos let's suppose this is something i want to find okay so to know about that do add the questions that you want to find out in the chat whatever answer you need to find out do let me know in the chats okay so to find the number of views of all the videos i write i'll take d off of views so these are the this is the list where i'm having all the views i can write dot sum okay dot sum is not working here because by default it is giving the output as a string most probably so if i'm going of views of 0 here you can see i'm having and if i'm checking the length of it it is 3 but if i'm checking the type of it so type of it is string so by default it is a string and what i need to do i need to make this rather than making it a string i need to make it a list okay now things might get a little complicated why because here we are having commas in between okay so what i need to do first of all i need to pre-process this data set okay i'll take the views but i need to pre-process it how i'll pre-process it let me tell you that as well so i'll go for for i in df of views and i'll print i if i'm printing i what will happen it will print all the views of the first video second video third fourth and so on up to the last video okay this is something it is showing so far what should be the next step the next step would be i want these views like 392. this is something that i can directly convert into an integer but if i'm talking about 2 comma 2 to 0 this will give you an error why because we are having a comma how we can convert a comma with 2 comma 2 0 0 into a list or into a integer that is not possible right so what i will do i will find if there are commas in the video or not and i will for example if i am having 2 comma 2 to 0 so let me add one cell here if i am taking let's suppose i am having 2 comma 2 to 0 and that 2 in the string format so what i will do i replace what i will replace i replace the comma with nothing so now if i'm executing it this is something i will get okay and this is something i can directly convert into integer because i am not having the comma now so what should be the step the step would be i am writing i dot replace okay and i will replace two things what i will replace i will place the comma with nothing now if i am executing it earlier as i was having 2 comma 2 to 0 okay now what i am having 2 2 2 0 now same thing i take it as views is equal to i'm creating a list and in that particular list i am writing df off and after replacing it i am writing it here directly okay views dot append and if i am executing it and printing the views here you can see i am having all the views okay and let's do that as well now still we are having the output in a string let's make it an integer now if i am executing it here you can see all the numbers all the views are in integer format simple right so what should be the next step the next step would be and one more announcement what i would say we are about to hit 500 000 subscribers we just need 31 more subscribers come on everyone just share the videos with your peers just share the videos with your peers i can show you in front of you like we are about to hate it all we need to do is just 21 more subscribers and we are good to go so let's do that and let me share the screen okay so here we are having all the views now this is the list if i'm checking the type of it so the type of views is a list now what i will do i will take this list and i'll save it in the original data frame where i'm having title where i'm having if i'm going to the view it should be views so i'm taking df.views and i'm writing the same views here okay and once that is done and if i'm printing the data frame now sorry if i'm printing the data frame so in the data frame now you can see the views earlier was in this format three nine two two comma two two two zero two nine five uh means uh nine zero five but now i'm having the output in exactly the format that i want views are in list now okay now we might get some problem with likes as well because here you can see i'm having 1.8 k likes right or i would say uh in the comments as well i am having 133.0 which is not possible right so what i will do the views points are clear we are clear with the views point okay so what we need total views in all the videos that is our question that we need to find so total views in all the videos now if i'm going for views these are all the views if i'm writing dot sum so these are the total number of views we are having on geeks for geeks this is the this is the output right let's go to the channel about and let's see if that is means absolutely uh means close or not here you can see uh we having 50 million 884 4170 views but if i'm going to this particular data according to this i am having 45 million 539 and 373 views so there is a miscommunication of i would say 5 million views can you anyone tell me in the chats this is going to be a pretty interesting question can anyone tell me why we are having a difference of 5 million views that we have calculated and the channel is analyzing why we are having different top 5 million views do let me know in the chats quickly so i tell that like till the time you are answering the question i am going to the point okay these are the total number of views we are having in the channel okay total number of views in all the videos combined clear so now uh let me ask okay which role is good for the freshers for your software engineering or data analysis please and my other videos let me know okay so okay that is like a quite off topic but i i'll answer that as well which role is good for so got me saying which role is good for freshers associate in software engineering versus data analysis so that totally what i would say uh it totally depends upon your interest and your skill set if your skill set is if you are what i would say if you are very good in data analysis or data science so obviously you can get the pressure rule in that as well because you are having the practical knowledge same goes for the software engineering is that totally up to you depends upon your skill set what do you want to know what do you want to do okay so so it's the time saying it's about it's because of the time zone okay so it is not because of the time zone it is because in the original data set i am having how many videos in the original data that i'm having how many videos do let me know in the chats how many videos i was having in the original data set i was having 2011 videos right so in the original data frame if i'm executing and df and instead of that if i'm df dot describe if i'm writing so here you can see we having 2058 videos we are having earlier but now we were having how many videos now we are having means obviously after executing all the cells that we we have done so far so okay so we were having i of data df dot values okay so we have missed the part so we are reading the data describing it and after that we are removing the null value df dot is null okay okay okay so we here here i have missed one thing and that is df is equal to df dot drop any so i'm removing those null values and once those null values are removed here i need to write okay the over condition is something i am removing so here if i am executing it here i am having all the titles so i am converting this into a numpy array and after that these are the total number of views we are having okay so earlier i was having how many views uh or df if i'm writing df.described now early as i was having 2000 and uh means more than 2000 video now i'm having 1854 videos only that means there are almost 200 almost 300 videos 200 to 300 videos whose data i don't have and those videos might have up to 5 million views that is the reason why there is the miscommunication and or there is the difference between the numbers okay so now let's do one thing now because we are having the data set with us df.describe is in front of you so now what uh what i can do how can so how i can sort the values let's discuss about that okay so i can write df of okay let's let's do one thing let's analyze the likes first okay in the likes we are having some problem okay let's take comments first that would that is going to be the easier part and after that we are going to discuss about likes as well okay so i am taking let's analyzing uh comments okay so analyzing comments means we'll go to each and every comment and find uh what is the different which is the default data type of comments and after that we'll see if we can make any change in this so if i'm writing df.com and this is how i am getting it so in the comments i am having 0.0 comment 3.0 comment so what do we mean to say the comment number of comment are 0 3 1 0 2 19 15 11 these are the number of comments we are having right but here we are not having the comments in that format we are having the comments in float format okay so what i will do i will go for and a for loop for i in i'll use the same approach and i'll print i first thing will be just go through the data set and see how the whole data is structured so here you can see what we are having we are having the comments in this format and we are having 318 comments here so i want to find if we are having comma in this or not right that's the main funda that i want to know so here so far i cannot see it so let's do one thing let's convert into integer and now let's execute it and check what is the output we are getting if i'm scrolling down let's see if we are having an error we are having no error that means the comments are now converted into integer this is the data set this is what i call it as data pre-processing before analyzing the data we are pre-processing the data so that it can be analyzed easier in an easier way okay so now i am writing comments is equal to a list and here i am writing comments dot append and i am appending the data and if i am writing comments so these are all the comments we are having this comment i am going to add on the original data frame okay on the original data frame what is the name of the title comment okay let me go down let me check it here okay let me add the cell if i'm printing the data frame so here you can see i'm having comments okay so in the same column i'm writing df comments is equals to the comments that we are having so now if i'm writing the data frame or printing the data frame here you can see now i am having likes only remained i am having views in the integer format and comments also in the integer format simple no rocket science now what should be the next step this is going to be a really important one okay once we are having all these things now we need to analyze the likes and how to analyze likes this is going to be a complicated thing okay analyzing likes i am writing analyzing likes so i am just doing the data preprocessing so far okay and once the data preprocessing is done after that i'll jump into the data analysis part okay so now uh i'll take likes df of likes will give you all the likes and what i will do i'll go for the same approach for indf of likes and i am printing i so this is something that i have got okay so these are all the likes we are having so we can convert them directly into integer but there are some videos where i am having k like 1.2 k okay so what else we are having okay 1.2 k we are also having k here so what should be the process so here first of all i need to check okay here i need to write some conditions okay so here i will check if the i am taking i what is i i is the like of that particular the specific video and if i am checking okay before checking it let's print the type of i and after that just break now if i'm executing it here you can see i am having the class of it as string so i need to convert it into a list so first of all what i will do because it is a string i will find if k is here in i or not k here is here in i means then i will print i so what will happen let me tell you if i am executing it f k is in i okay it is not giving you any output let let me see why it is not giving you any output so if i'm printing uh printing this and after that if i'm sorry my bat so if i'm printing i so i'm having all the videos here okay and if i'm printing the type of type of i so this is the type of i okay which is string by default now what i am doing here if i am checking if uh if okay let's let's take it as it is so if print let's print k and i because i want to know about the process why it is not happening and i here i am writing if i in k is in i or not i is the title so if i'm executing it here you can see it is returning a false okay so okay let's ignore this method and let's take i directly and now if i am converting directly into an i so here you can see it is converting it it is converting it but here once we reach 1.2 k then it is giving you the problem now if i am printing i this is 1.2 k okay so what i want i want to know about but if i am executing this so this is so this is k so what i will do i will take i and i will check the last ok so problem is so here very very silly problem okay so here i am searching k in i but here we are having capital k i am writing small k same thing like this is a brilliant example brilliant mistake i made so this is where i will use the lower case like i have just told you why i'm using the lower case because here i am writing small k but in the data set or in the comments i am having capital k so i will reach here and here i will write if capital k is in i so i will print it so this will return you all the videos where i am having k written okay so here i write if k is written in this video then what i will do i will slice it how will slice it for example if i am having i which is let's suppose 1.1 k so i remove this last k i'll remove this last k i will use slicing for that okay and before jumping ahead we are very very close to the target very very close to the target i'm really really excited we just need 25 more subscribers just hit that subscribe button most of you are subscribed already so share this with your group with your friends and do let us achieve it in next 10 minutes let's see if we can achieve it by 6 10 okay and i am going to give you this uh give you the announcement at that on at that time only but make sure uh once we reach 500k only then i'll be in the assignment uh the announcement okay so let's jump back to the code so if i'm writing if i'm slicing this so by default i'm having 1.1 k in all these pattern in all this we are having one pattern repeating that pattern is only the last character is k so i'll remove the last character from it and i'll have 1.1 as an output so what i'll do if i'm writing minus 1 here all of them are now having the k removed so what i will do i will take this what do you mean by 1.2 k 1.2 k means to 1.2 multiply by 100 right 2.1 means 2100 okay so what i will do i'll take this and for that purpose i need to convert it back to a float as well because otherwise it won't work because this is not 1.2 this is not one point this is not 1.2 this is this 1.2 which is if i'm checking the type of it so the type of it is what string what i want i want to convert it into a float and once that is converted into a float we are having 1.2 as an output okay so 1.2 is the output so what should be the next step i will multiply it with 100 and it will get 1200 and i'll convert it into an integer okay it is 120 okay so ah sorry my bad i need to multiply it with thousand my bad so now if i'm executing it here you can see the output let me convert into integer because the comments cannot be in float we cannot have 5.5 comments that's not possible obviously right so here you can see i'm having all the comments but these are specifically for the ones where i am having k written if k is not there then that case what i will do if k is not there then what i will do we we might have m as well okay let's find that out if we are having some video where i'm having million views okay so here i am finding if m is there in the video or not if m is there in the video i'll print that particular video execute it and let's check let's execute the data set and here you can see okay we are having no videos uh with a million views but we will be there this time we will be there definitely we are working really hard on some of the topics we will be there okay so here you can see if we are having k written that means the multiply by thousand otherwise the video will be taken as it is after converting into the integer so now if i'm executing it here you can see 21 views 131 views 26 views 52 views 41 of them and now we will add comments okay it is likes my math so they are about like so i am writing likes an empty list and after that i am going to do it like likes dot append okay i am going to likes dot append and here i will write likes dot append i so now if i'm executing it and going for the likes what i'm having i'm having the likes displayed as it is okay if so that that's how we how we are having the output so inverse is saying i think you really need to find a way to retrieve those numbers in a different format because probably the millions yes million can also be there but i have find out the million that m is not there if m is there then obviously i will go for one more condition that is very naive approach i am telling you okay if m is there then multiplied with with a million multiply that number with a million if k is there multiplied by a thousand if b is there multiplied by one billion only three condition you need to write to retrieve the things and this is the this is the effort that we can make easily right so once i am having the likes with me what should be the next step the next step would be once the likes are there i am going to take these likes and i am going to take the data frame write likes here and execute it and if i am printing the data frame now my data is way cleaner than earlier why it is very cleaner than earlier because here we are having what here we are having number of likes here we are having number of likes in the integer format number of views in integer format number of comments in integer format right so this is the data preprocessing technique we are discussing about this is a very very important part because if we are having directly the data set we cannot perform these operation on it how we can find what is the maximum video uh if you can if you cannot compare like three uh three thousand and uh like three comma thousand you cannot do that right because both of them are not equal here we are having a comma and that too it is a string strings cannot be compared directly we need to convert them into some sort of numbers can be integer or float only then we can perform operations on it simple so now what should be the next step what what are the things that you want me to analyze after that i am i'm trying trying to jump into the duration part now this is going to be a very very important thing okay so i am goi
Original Description
This is a special 500k special stream where you'll learn the end-to-end process of Youtube Analysis.
End-to-end means you'll also learn how to create the dataset of all the videos of a youtube channel, like video title, thumbnail, video link, likes, views, comments, etc. Once we have the dataset, we'll analyze the dataset afterwards.
Our courses : https://practice.geeksforgeeks.org/courses/
Please Like, Comment and Share the Video among your friends.
Install our Android App:
https://play.google.com/store/apps/details?id=free.programming.programming&hl=en
If you wish, translate into the local language and help us reach millions of other geeks:
http://www.youtube.com/timedtext_cs_panel?c=UC0RhatS1pyxInC00YKjjBqQ&tab=2
Follow us on our Social Media Handles -
Twitter- https://twitter.com/geeksforgeeks
LinkedIn- https://www.linkedin.com/company/geeksforgeeks
Facebook- https://www.facebook.com/geeksforgeeks.org
Instagram- https://www.instagram.com/geeks_for_geeks/?hl=en
Reddit- https://www.reddit.com/user/geeksforgeeks
Telegram- https://t.me/s/geeksforgeeks_official
Also, Subscribe if you haven't already! :)
Connect with Ashish -
Github: https://github.com/AshishJangra27
LinkedIn : https://www.linkedin.com/in/ashish-jangra/
Kaggle: https://www.kaggle.com/ashishjangra27
Dataset Link: https://www.kaggle.com/datasets/ashishjangra27/geeksforgeeks-youtube
Web Scraping in Action: https://www.youtube.com/watch?v=f-Z35mTkzWI
#GeeksforGeeks #SpecialGiveaways #Coding
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from GeeksforGeeks · GeeksforGeeks · 43 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
▶
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
How I got into Walmart | Shailesh Sharma
GeeksforGeeks
Upgrade yourself In 29 Days | GeeksforGeeks
GeeksforGeeks
Learn AWS Fundamentals For Free
GeeksforGeeks
Conversation With Young Achievers | Meet the winners of Bi-Wizard Coding Contest | GeeksforGeeks
GeeksforGeeks
Meet The Winners Of Bi-Wizard Coding Contests | GeeksforGeeks
GeeksforGeeks
Interview Prep Strategies | PayPal
GeeksforGeeks
OLX Interview Preparation Strategies | Hukam Singh
GeeksforGeeks
Meet Some More Winners Of Bi-Wizard Coding Contests | GeeksforGeeks
GeeksforGeeks
Live Mock DSA
GeeksforGeeks
Microsoft Azure For Absolute Beginners
GeeksforGeeks
Python for Data Science | Data Science Master Bootcamp | Arpit Jain
GeeksforGeeks
Getting Started with Data Analysis | Data Science Master Bootcamp | Ashish Jangra
GeeksforGeeks
How to prepare theory subjects for SDE interviews | Geeks Summer Carnival 2022
GeeksforGeeks
Get Your Tickets To The Geeks Summer Carnival | GeeksforGeeks
GeeksforGeeks
TED Talk Data Analysis Project | Data Science Master Bootcamp | Ashish Jangra
GeeksforGeeks
How I Secured AIR 9 in GATE'22 | Tushar
GeeksforGeeks
Learn Java Backend Development | Geeks Summer Carnival | GeeksforGeeks
GeeksforGeeks
How to Recognize which Data Structure to use in a question | Geeks Summer Carnival | GeeksforGeeks
GeeksforGeeks
Learn Data Structures and Algorithms | GeeksforGeeks
GeeksforGeeks
Interview experience at Flipkart | GeeksforGeeks
GeeksforGeeks
Lets Prepare for GATE'23 the Right Way | Sakshi Singhal | GeekSummerCarnival
GeeksforGeeks
Highest Paying Jobs in 2022 | Ishan Sharma | Geeks Summer Carnival 2022 | GeeksforGeeks
GeeksforGeeks
Geeks Summer Carnival 2022 | 5th April- 11th April | GeeksforGeeks
GeeksforGeeks
Preparing for SDE interviews | Soham Mukherjee | Geeks Summer Carnival 2022 | GeeksforGeeks
GeeksforGeeks
Full Stack Development with React & Node | Utkarsh Malik | Geeks Summer Carnival | GeeksforGeeks
GeeksforGeeks
Introduction to Open Source and Roadmap to GSOC 2022 | Geeks Summer Carnival 2022 | GeeksforGeeks
GeeksforGeeks
Web Scraping in Action | Geeks Summer Carnival 2022 | GeeksforGeeks
GeeksforGeeks
Getting Hired at BITCS via GfG Job Portal | Get Hired With GeeksforGeeks
GeeksforGeeks
How to build a faster landing Page | Geeks Summer Carnival 2022 | GeeksforGeeks
GeeksforGeeks
Geeks Summer Carnival | 5th To 11th April, 2022 | GeeksforGeeks
GeeksforGeeks
How to get ideas for Startup | Geeks Summer Carnival 2022 | GeeksforGeeks
GeeksforGeeks
Journey from Tier 3 to JusPay | GeeksforGeeks
GeeksforGeeks
Geeks Summer Carnival 2022 | GeeksforGeeks
GeeksforGeeks
Dispelling Myths and Pre conceptions of Programming Languages
GeeksforGeeks
Must Do System Design Questions
GeeksforGeeks
Understanding Sorting Techniques in an hour | Keerti Purswani | Geeks Summer Carnival
GeeksforGeeks
Get Hired at NEC | Job-A-Thon 8
GeeksforGeeks
Journey from Tier 3 college to Microsoft | GeeksforGeeks
GeeksforGeeks
Get Hired with GeeksforGeeks at SuperK | Job A Thon 8
GeeksforGeeks
GeeksforGeeks: Redesigned
GeeksforGeeks
From Tier 3 to cracking multiple interviews | GeeksforGeeks
GeeksforGeeks
Live Mock DSA
GeeksforGeeks
Youtube Data Analysis | Ashish Jangra | GeeksforGeeks
GeeksforGeeks
DSA Self-Paced Course Preview | Sandeep Jain | GeeksforGeeks
GeeksforGeeks
GATE Live Classes | Prepare for GATE CS 2023 | GeeksforGeeks
GeeksforGeeks
Journey from JIIT to Adobe
GeeksforGeeks
Life Is Unfair Ft. Shonty badmash | LIVE Discord Session | A GeeksforGeeks Exclusive
GeeksforGeeks
Interview Experience at Google | Tech Dose
GeeksforGeeks
Live Mock DSA
GeeksforGeeks
Interview Experience @ Amazon | GeeksforGeeks
GeeksforGeeks
My journey through the tech world from India to US | Vidushi | GeeksforGeeks
GeeksforGeeks
Complete Interview Preparation Course | GeeksforGeeks
GeeksforGeeks
Live Mock DSA
GeeksforGeeks
Getting Hired at FiftyFive Technologies | Job-a-thon 9.0
GeeksforGeeks
GFG Karlo, Ho Jayega | GeeksforGeeks ft. Khaleel Ahmed
GeeksforGeeks
How I got job offers from 2 big companies : Arcesium & Microsoft | GeeksforGeeks
GeeksforGeeks
LINUX for Beginners | GFG x Itversity
GeeksforGeeks
My interview experience at Walmart | GeeksforGeeks
GeeksforGeeks
Get Hired at Speckyfox
GeeksforGeeks
Live Mock DSA
GeeksforGeeks
Related Reads
📰
📰
📰
📰
Your Manager Is More Dangerous Than AI (And You Don't See It Coming)
Medium · AI
The AI Skills Nobody Is Talking About. But Every Professional Will Need Before 2030.
Medium · AI
The picking scorecard still has no row for MCP and the GitHub trending list is dominated by MCP servers
Dev.to AI
The Architecture of Insight: How the Brain Downloads Ideas, and Why Stealing Concepts Resets Your…
Medium · AI
🎓
Tutor Explanation
DeepCamp AI