๐Ÿ” Live Coding - Food Nutrient Dataset Creation with Python

Rob Mulla ยท Intermediate ยท3y ago

Key Takeaways

The video demonstrates the creation of a food nutrient dataset using Python and Kaggle, covering data scraping, cleaning, and analysis. It utilizes various tools such as Pandas, NumPy, and Matplotlib for data manipulation and visualization.

Full Transcript

no I'm not hello hello hello everyone how's it going it is February 2nd 2023 and we are here to do a live coding stream I hope you're all excited to be with me I hope that my stream is working I can't quite tell let's just double check yeah it looks like everything's up and running so let's go ahead and do it hey we have Skid Row HD gaming welcome to the channel hope you're doing well tonight what's up are you guys ready to do some streaming you guys ready to do some coding let's get to it it's a Thursday night and we're ready to rock hello Roger how are you doing tonight hopefully well cheers to everyone out there so I have a few ideas that we could work on but um I'm gonna let you guys choose so let's see if anyone's in the twitch stream I'm going to switch over here um we have the twitch stream here and I'm gonna go over and create a poll to see what we're gonna be working on tonight what should I work on tonight ice cube kaggle competition um U.S what's this other one called create food data set something else so these are our three choices I'm gonna put this timer on for five minutes um and I'm gonna start this poll so let's go ahead that go to that make this a little bit bigger and you guys can vote on what you want to see hey Mr Gabriel's in the chat um I'm reading a book building a data science applications with fast API very cool very cool um what's that drink by the way so this is a local Brewery made this drink and it's called Uh surrender Dorothy it's the name of the beer it's a really good beer it's one of my favorites not just because it's local to me but just because it's great so the food the food data set just to give you a little bit of a background on it um when I first started streaming a little less than a year ago why is this video all messed up hey welcome everyone into chat uh let's get this normal let's get this a little bit more normal we'll try to be a little bit normal so when I started streaming um a little over a year ago I was going for data sets Grand Master and it's still kind of a goal of mine but I did get a little bit um less excited about it after a few months of grinding and not feeling like I got anywhere now pretty cool thing just happened this past week uh if I sort by the most votes I in order to become data sets Grand Master and become a quadruple Grand Master on kaggle I will I would need four no wait five gold and five silver medals I currently have five silver and just recently bumped one up to another gold so I got my third gold in data sets which is from this data set that we created on stream this is um actually we wrote this code I think it's this one no wait this is the code that actually pulled the data set and it pulls from the imf.org website every day impulse exchange rates from around the world so the IMF data set or website is uh I wish I could find where the source of this is uh yeah but this is the code so we actually wrote this code that went and scraped their website and pulls down that data daily and then saves it as a kaggle data set so this has been running every single day updating and I gotta vote on it so now I am I got enough votes on it that now it is gold uh what are people saying I'm reading the future engineering book this week what feature engineering book let me know Ralph uh idea on how to handle unseen categories in inference data you you kind of need to make a unseen feature or you need a lump them all together it's really hard to create a machine learning model that can work on future data that has completely different categories I mean it's impossible to train the model on that data unless you have something similar to it just ignore that feature completely is what I mean that's the best thing I could think of Okay so where is our voting so yeah so what I was going to say is we a while ago started working on this uh uh USDA food data set which basically has all of food that's sold in the U.S and all the calorie information all that stuff I was thinking this might be a cool thing to actually make into a kaggle data set and that hence is the uh this one create food data set that I'm proposing here is something that we could work on now the other thing is the ice cube kaggle competition that's gonna be a little bit more of a grind I was looking at it just today and um there hasn't been much movement on the leaderboard so we did look at this competition recently the last stream we did was on this competition so we didn't really do anything other than exploratory data analysis and they actually have as part of this an early sharing prize where they're going to give a five thousand dollar cash price to encourage participants to share information earlier and help make Community make progress submit one or more notebooks before February 2nd at what time is it what time is it UTC so this is passed this is passed submit one notebook by February 2nd and publish them before February 4th so that's the thing we're a few days early on this people are probably going to be sharing things in the coming days um and it I know I thought I saw Tito tweet about this that he's going to be releasing what created hit well there might be people up here releasing what they did to get to this point in the leaderboard so they can get the five thousand dollars what that also means is we could wait and look at this next stream and probably have more to look over instead of trying to grind away on on this stream here although the ice cube kaggle competition did win so do we want to stick with that uh why are you saying wait let's go for food this week yeah I did not realize that there was this delay on the ice cube competition or I actually didn't know anything about the the dates for this uh public sharing so let me close all these Ice Cube overview early sharing prize um submit one or more notebooks by that time has passed and published them before February 4th the delayed publication intended to allow you to participate in early sharing prize without being concerned with someone else might resubmit your work that makes sense so a bunch of people are probably submitted their notebooks already they're they got their leaderboard score and they're going to make it public soon and then we'll be able to look at those how often do I stream it really depends I've been um I've been struggling a lot lately lately to keep motivated but I've tried to do it Tuesdays Thursdays and Sundays most weeks I'll try to do at least one maybe two of those days the results was four Ice Cube times three data I always change my vote from food now 3s Cube for oh Mr Gabriel changed his vote okay so I think if if I stream on Sunday which I probably will stream on Sunday we can look at some of the results of this code and try to implement them um one of the cool things okay so we'll look at this uh this competition a little bit before jumping into the food data uh so Gabriel thank you for changing your vote uh that makes things a lot easier but before we jump into this to the food data let's look at at least the best score that was just released it's interesting because it's using something that we had discussed last week in the Stream um let me hmm let me load up what we did last time so add this kaggle ice cube neutrinos we had this Eda PCA by Baseline and we started working on a pi torch one but this is what we looked at so we what I noticed was or what we noticed on the stream and I'm sure other people have noticed that the look at this data set um is that there's a Time component to the data that's coming in that we've gotten so these are all the sensors like if you think of them it's like a 3D representation of every sensor that they have um in their this big Ice device that's measuring neutrinos and when they give us a an event to predict on then they tell us if auxiliary is true if auxiliary is equals true they say that's probably just noise so we could filter out just to the blue dots in this plot and what we were seeing was that there are kind of like two or there's kind of one location where most of the events are condensed into or they're they're found around right uh so we're looking at one example I think yeah this is it so if we included these first blue dots these early ones they are kind of not within the line that represents uh the actual result uh that we're trying to predict but if we remove them then we kind of got this this nice path that followed the line uh at least the the ones that fired in that second part uh aligned up with this line better all that's to say that's a long-winded to say way to say uh that this top public notebook now actually honed in on that idea so I think I have it open here twice now uh but what he did so what solver world did was he added to the existing public notebook that just did ordinary least squares in order to find the way the line is fitted he added this time window function at least that's my understanding I don't know the whole history of whose notebook did what parts of this but I believe this is the new part he added and basically what it does is it takes in this time representation of the data like we had up here and it finds a window the best window where are we the window where the most amount of points exist in that window so it's just an optimization function and it looks like his starts in the center he just takes the average minus 1000 to plus one thousand I guess he's only looking around this uh plus or minus one thousand in the center and looking for um the best the best uh set of points with a width of four thousand that would represent you know the that would get the most amount I mean I think there might have been a there might be a different way that we could optimize this to find the best points or maybe he found that this worked the best but it's interesting ly that it kind of is off of what we saw in our Eda last week I mean this is kind of an angle that we were thinking about going down with this data set and actually filtering down to the the points that actually matter does that make sense MMD welcome to the thing welcome to the stream the Ice Cube thing is cool thanks squeal you're late you're never late when you're here that's when we start you are the one that makes us on tame time anyway so I think this competition is going to be really interesting um we if you haven't watched this stream that we did last time check that out we talked a little bit about the data set and what it involves and we'll probably take a deeper dive into some of those public notebooks next week once they come out it's 3 A.M for you MMD go to bed take a nap you can watch this later um yeah definitely a very cool competition but we're not going to jump into that tonight because you all voted and re-voted we're going to look at the USDA food data so let's go into the twitch stream projects notebook folder GitHub repo I always hate these IPython notebook checkpoints I have to delete that file uh and then I'm gonna I'm already activated my correct conda environment and let's just start off nice and fresh and clean and look at this USDA data and make a kaggle data set out of it maybe we'll even possibly make it automated kind of like we did with that IMF data set hello joining for the second time curious to see how you do thank you Adrian welcome to the stream cheers to you hope you're all having a good night I hope the sound is okay in the audio's not delayed too much definitely respect for MMD who's hanging out at 3am where are you in the world okay so we found this data set from this food Central data website I'm going to put this into the chat so you guys could see it MMD you want to ask a question ask away let's do it ask away so we're using this website to pull the CSV format of all these historical data tables first are you sharing notebooks second good books third I meant from Portugal nice I love all those things uh number one is I do if I remember if I'm working in a kaggle notebook I try to make those public and I share those with you if I'm working locally I try to work in this GitHub repo that I'm going to share with you guys and that repo should have a lot of the stuff that we worked on in the twitch streams so you can see like actually um yeah I don't know this is the very first thing I worked on was this roller coaster data set a year ago and we did this all on live stream um good books uh there are a lot of good books it's just a matter of what you're trying to learn about so I'd say this is a good starting book if you're looking to start is python for data analysis it's completely free um there's an open access HTML version and it'll get you started oh shoot let me make sure I get the free version I'm gonna put this in the chat this is a good book to get started with and your third thing I don't think was a question never used twitch so you're out okay cool um that's just the name of this repo though how to study Cloud for data science it's tough it kind of depends on which Cloud platform you're working with foreign but let's let's look at this food data central website and try to remember what we did before we just did w get so let's make this into a kaggle notebook I actually think we want to make this a kaggle notebook so then we can automate this um and then I definitely will share this so everyone can see it we're gonna we're gonna get a nice emoji of a food let's get a hamburger in there copy this put this into our notebook title so we remember and let's go ahead and do this uh let's create beautiful soup oh yeah and do you guys like the dark theme I think people usually like when I put the Dark theme on even though it's not all right so we're going to scrape usda's Food Central data we're going to use requests in beautiful soup I'm old I prefer light but I'm game for other yeah I think some people watching like they want it to be darker I know that when I'm watching something and my wife's next to me in bed and it gets really bright she's like turn turn it off oh she doesn't yell at me but she she asked me politely can you turn the light down so I'm trying to help all you guys out with that all right so how did we do this before we found all the links let's make a function called get USDA get food links and we'll put this URL in here right and then we'll return the the links get food links let's see if this works now we have the links that should be all of the supporting food data supporting data all the data from the USDA so we're going to go ahead and take this and we're going to actually make a bunch of okay so this is all relative to here we're going to actually need to download these data sets from USDA so we've done this before what's something recently that I've done this before with so I can just copy that code SCC data didn't we do that with SEC data we kind of did it here where yeah get links we have this like exact same thing I wonder if I can just take this and basically Fork this notebook basically cup copy this um so instead of these links being from the SEC website it's going to be from this URL I'll comment this out because I'm not quite sure if that's correct base URL let's make a base URL which is going to be this right well I don't know why there are all these weird spaces in here so this is going to be our base URL plus this and that's going to give us our zip files it should give us our zip files and then we'll see if we can do this step by step Zips is going to be yes this is all of the zip files so we got this correctly let's let's not forget about chat what are you guys saying you have an unzip function I've seen on your GitHub yeah I have this uh if used dark reader extension that's what I'm using set this configs I think it looked prettier bright plus 20 contrast minus 20 gray plus 15. okay bright plus 20 . I love this contrast minus 20. oh nice gray plus 15 I like this oh look at this hack there it is how does that look better I've never seen web scraping before used to use twint uh sentiments yeah you like this better I I do too thank you for the for the tip Okay so this links it looks like it works so we can get all the zip files is this freezing on me or what what's the deal come on kaggle Notebook are you frozen don't freeze on me now Look it won't even let me scroll come on come on kaggle just in just when we need you the most kaggle you gotta go let's go sit cross-legged because we know we're serious we're gonna reset this and scrape USD Central data let's add some more info in this notebook we will create a pipeline for pulling data from the usda's website about food and then we'll say this stuff this stuffy stuff format into folders there we go these are our helper functions for doing so we already have our zips working so basically this gets a list of all the zip files should we do Zips dot sort I guess that doesn't really matter it'll just pull the supporting data first now let's see if download zips so let's do this again we're just going to test it out here on the first five so we'll try to download the first five and see how it does okay so tqdm is not defined from tqdm Auto Import tqdm tqdm gives us our progress bar URL lib is not defined so let's go let's go to SEC creation data set and make sure I have all this stuff imported basically we want this and not what we have imported up here let's go ahead and reset this notebook factory reset well they changed that name now it's more official hi Rob how are you doing today swapped I am doing great I'm glad I started streaming tonight I've been trying to get myself motivated because I've been in a little bit of a rut but now that I've started this I feel a lot better I feel like we're on to something here tonight what do you guys think random question does GitHub co-pilot help with data science it's helped me some recently any good books you recommend to Learn Python yes I just sent a link to one that I suggested this Wes McKinney's free book I'll put it there again for you my friend all right we're trying to download these first five zip files and then we'll see where we put them locally so our download Zips basically just takes URL lib and then retrieves that zip folder and we have it just putting in our base folder here so let's see what this looks like once it's done it's only taking 20 seconds for five files what's the length of our zips 52 so 52 times 30 divided by 60 so it would take 26 minutes is that right oh because this was this was five so it would take five minutes to complete downloading all of them if we did it but if we do ls-l this will show us all the files that we downloaded here we also have this notebook Source but we want to see now if we try to unzip these files if it'll work so it's basically just gonna unzip it here into our local directory um so this code we do glob on all the local zip files then we pull out just the file name which is that last part and then we use shutil unpack archive which is basically unzip to unzip these files and then if delete zip equals true we'll delete that zip file name we probably should have run a time on that to see how long it will take let's see and I also would like to see how big these files are I guess I can look at them locally we could also add a tqdm to here it might make it nicer in the future so we can see the progress of the unzipping all right let's do G flash on this which will give us our pretty output now we have a folder for each file so if I do of food data Central Foundation yeah look all of these food files are within each folder so I'm pretty happy with this what do you guys think what do you guys think use two cute him on your mom what are you talking about um have you tried pandas API on spark I haven't I need a try Spark uh Josh said what's up I love your videos been learning pandas recently man it's tough hey once it's clicks for you it'll be a lifesaver Kathleen Linux base is is is kaggle Linux base yeah this is like a Linux environment running Jupiter is basically what kaggle notebooks are man I really appreciate the suggestion that you guys gave me to uh to change the Dark theme I like this a lot better and chat's going a little bit too fast for me so I hope there's nothing I missed it's crazy right python statement Empire statement in a notebook compared to object oriented program I mean everything's object oriented in Python hello what am I doing we're scraping data from the food data central data set that's a mouthful but we're about to run this on everything so this is going to create our final data set basically we want this Zips to be everything so let's debug first if debug equals true we'll just make this the first two just so we know and then we're gonna LSG Flash the output that when we don't need any of this stuff all right so let's try this just to make sure it works on two and then we'll run this and save this for all of them and I will share this link with you so surprisingly it seems like unzipping takes longer than it takes to uh download the files what does debug do I'm just making it I'm just making a debug thing here so that we can test and make sure that this code runs before running it on all of the files and you know what it's actually doing it's probably hanging up because it needs to replace the files that we've already downloaded I'm guessing that's why it that's why it failed so I'm going to restart the whole session which will clear all that data that we downloaded into the base directory and we're going to start from scratch No Object oriented programming is not as declarative new paradigm to learn yeah I don't know if I'm doing this right but I'm doing it we're doing it people all right so this worked and we have this branded food uh should we also let's add some logging let's add some logging because I made a a short video on logging and let's just make a bait like a very basic logger let's set up the Deep basic config like this and then we could say here logging dot info and say we're downloading the file name uh and then add here login.info and let's just add download starting download right and then we can put this here starting unzip and then we can say unzipping and this is called FN all right so this should be a little bit nicer we'll reset again factory reset boom oops failed to fetch I'm not sure what that means cannot save oh did I really screwed this up uh I added a comment what did your comment say why use logging rather than print statement I don't know there's a lot of reasons why you'd use logger so you can log you can set the Stream to go to a file and standard out or standard error or you can set logging levels afterwards if you're doing multi-threading print statements don't work very well they'll work well when you're doing use logger um it's just a good practice to get into because you can also add things like the level name um the message itself what else other reasons to use lager logins use a lot by that guy AI oh I don't know who that is or maybe I do I just I'm I'm not remembering Chris I recently gone to your recommendation love the channel thank you thanks for hanging out with us tonight please make sure if you if you like the channel just share share stuff with your friends um no like like the videos watch them all the way through send the link button write a bunch of comments everything to help the algorithm out all right I'm going to start to re just refresh this it's really not liking me right now I was mean to it earlier foreign yeah so another thing is like if we had all these login statements in here and we actually didn't want it to show all the info logging like those are just two verbose for us we only have to go here once and change the logging level and then everything else would um I mean it would only show at the level that we're logging at laguru is that better than logger about this really really doesn't like me right now a guru's Latin for lager okay brought to you by nordvpn I'm not sponsored this is not a sponsored stream it's easier to set up the way you want it okay I'm just so used to oh wait not Swagger La Guru let's look at this oh I think someone's mentioned this before in stream from La Guru impu lager logger.debug ten times faster than built-in logging and then they crossed it out I don't know why they did that yeah this looks pretty handy like I definitely need to look into this one thank you for your hand for um mentioning laguru oh so you just do logger to add file give it a format level a message that's pretty cool it seems similar to the default logger but a lot of this just seems like um a better API Maybe okay where are we on this did this run it did it ran now we're gonna turn wait but our logging should have shown right shouldn't we have seen the logging output let's remove dash r make sure we removed everything and then try running this again hey Zees Sean you didn't need to you didn't need to donate but I appreciate that that's really nice of you want to make sure make wanted to say to make very dope and interesting content keep it up thank you very much I'm appreciate that I appreciate you uh checking out my videos and stuff link the data set or challenge please sure let's I was gonna get this running first and then I'm gonna save it I think it's working so let's let's see if I can run this and then I'm going to go to settings here and I'm going to make it public very very public and I'm going to save the changes I'm going to go to the notebook here which is currently running so you guys probably can't see it but I'm going to put it in the chat so when it is done running let's run it in an incognito do you think the calculator I will clone your data set again oh maybe the guy who cloned my um my other data set Maybe I don't want to sound rude stalker like is there any way to chat with me MMD isn't it like 4am for you right now the best way to chat with me is with dit through Discord just uh message in our our discard Discord group so that doesn't come up I'm realizing in the YouTube chat so that's the link you can join our Discord and hang out and ask questions there and then let's look at this version it's still running and there are no logs wait are there logs there should be logs all right this reset now loading widget so it's it's like having trouble with tqdm I'm not sure what's going on like why is this to remove all the zip files could it be that could it be that the logger screwed up tqdm or do I just need to refresh this in kaggle can we import tinker yeah you can pip install like anything tqdm version issue let me print tqdm version um uh that's the thing I imported tqdm from tqdm so I don't know what version we're running let's try debugging I think it's more probably an issue with uh The kaggle Notebook right now than is anything so we calculate we think it'd be about five minutes to download the files and probably longer to actually run them so another thing I can do is I can actually download this so I can go to file download The Notebook boom go over here to my stream projects go into our food data we can move we can remove start at zip files from here remove those folders and then we can copy this from our download directory which is called food data scrape to here and then run this locally also just to make sure it's working so this is the exact same notebook that we just ran but a local version and a lot of times that gets rid of some of the errors that we see with running it in the notebook no module named tqdm okay so pip install tqdm obviously it's gonna have different versions of some stuff why don't you run a notebook locally in vs code likely fast faster and more reliable recently I have been it's uh it's just like a more of a feel thing that feels uh strange to me sometimes but I have been to answer your question I progress is not found okay I have this new Condor environment I'm working in so I'm installing everything fresh on purpose in 2023 it's like start off the new year with a new you new conda environment but that means that I have to install a lot of this stuff from scratch okay so now our login statements are working and it's saying that it's downloading all of them we're going to turn the debug off and we could see in here these files aren't that big I guess we're currently downloading this food data central CSV this 2021 1028 and now it's done so it's about 304 megabytes if we also go here they have them listed how big they are in each one of them so it looks like these Brandon Foods ones are really the big ones and then download full download all data types is also big so we will have this local version that we can work on with also uh why are you not using virtual environment I like using conda environments instead of virtual environments I don't know just personal preference for in while loop explain it a little bit okay so someone did mention using vs code I do have vs code here and we could open up the notebook in it like Grant granted if I'm actually working on like a script or something I will be writing that in vs code but yeah here's the vs code version the thing is the the keyboard shortcuts sometimes gets stuck for me in this and also my python environment needs to be this one yeah it works it works when you create a new environment do you not use a requirements.txt for all your typical libraries you use I actually don't create environments that often but yeah if I have a project like a GitHub repo with the requirements that TFT text txt you can just pip install Dash r but I purposely wanted to keep this one Bare Bones and then add on to it hand pink waving hey what's up trusty vs code yeah so you guys like vs code better um so now I have this weird situation where we have we're downloading from one notebook we're running in vs code in another one so I'm like halfway in between but let's go ahead and oh shoot your notebook tried to use more disk space than available what are you talking about how much disk space is available in these notebooks I thought it was a lot 73 gigs there's no way this was more than 73 gigs so I need to figure out what's happening in this notebook no space left on device something went wrong it must be like infinitely unzipping files or something I guess we're gonna find out on the local version all right so this hmm Why didn't it delete the zip file okay so it did delete the zip file and it's 2.7 gigs why is it 2.7 gigs maybe it's just a lot bigger than I realized and they compress a lot because they're so sparse of data should I only be getting these main ones these food data central full download all data types here's the thing that doesn't make sense to me does this full download of all data types include everything up here there's no way it could because this file is only 304 megabytes and the brand and food ones for October 2022 is 334. so how could how could the all data types be bigger than so make sure I'm not down oh this has like these access files oh shoot we don't want these access ones how big are those still not very big all right so what I need to do here is two things I need to stop this we're going to restart this we're actually going to shut this down completely so I don't get confused we're going to shut down our Jupiter kernel we're going to move completely over to vs code so I'm only in one spot then we are going to go ahead and go here and we're going to compare some of these what we've already downloaded so I'm going to take 2021 10 28 let's make a directory called test then I'm going to move 2021 star dot 2021 let's do the latest one 2022.04 into test and now if I go into test these are all the April 2022 files that we have we have branded food which is smaller than this one's actually smaller than the complete one so I hoping this actually has everything in it let's unzip food data central CSV this then go into here and this could be all we need to download really if I do LC w c Dash L that's 39 files ish and then let's unzip branded food CSV oh we're also downloading all the Json versions there's a lot wrong what we just did so we're gonna figure this out we're gonna figure this out for sure by default pycharm creates a new VMware for each project yeah I haven't used a pie charm in a long time do a line count on the extracted file you have to change the d-types as mentioned in the video to Res yeah yeah we're gonna we're gonna do that for sure for cheesy I just want to make sure that we I want to make sure that in this one base file we have everything we want like all the Branded food all right so top here is the Branded food unzipped bottom here is like the full food unzipped um so let's head this food.csv and then head so these look the same all right so in the bottom one we have food.csv which is 134 mags and 709 for Branded food so these looks like they're the same can I do like a diff on this branded food CSV and our food data central branded food CSV 2022-28 oh we want to if the Branded food with this yes yeah so they're exactly the same they're exactly the same what does that tell us that tells us we need to filter down in here what I am which files I'm actually trying to download because we're going way too far with it we weren't putting those guard rails up that we needed is Panda's really not installed in this environment why is it under squiggly lining my pandas hello from Brazil Pedro what's up what are you working on new to channel we are trying to download data from the USDA food data central this is the data set uh this is a notebook we tried to run but didn't work yet so why don't I run this just on the first two debug equals true and try to save this version um but before I do that let's work locally and try to find out a subset of these zip files that we want to run on so if I look at Zips I think we just want the CSV versions we can use list comprehension in here Z for ZN zips if underscore CSV underscore in z so actually I think they're all going to be this food Central underscored zsv underscore there we go now we only have these files that we're going to actually want to keep hello from Taiwan we got people from Taiwan people from Brazil clever list comprehension yeah for sure so this is actually going to be our Zips that we want subset Zips just to the main data um so we're gonna also go in into here and we're just going to start from scratch just to make sure this is working locally just to make sure this is working locally I'm going to go into here and I'm going to remove start.zip and remove test and remove those checkpoints because those annoy me Levy Nunes welcome to the chat a man all right so how big is this gonna be how long is it going to take these are all things we want to know widget requires us to download supporting from a third-party website enable downloads come on co-pilot so these are all the big files and that's why it's taking them a while to download but let's assume that this will work this is a better way for us to do it locally so let's also try to see if we can do it in this notebook and actually get this to work so debug is false we're going to save this version we're going to run it that way you guys can see this version once it's done running hello from Chile nice I love it I think on my twitch stream well okay so you guys are a lot of you guys are on YouTube but on the twitch stream you could see the countries that are represented here so at least on Twitch we got 11 USA two from Japan and two from Canada pretty cool so definitely check us over on Twitch if you haven't already Oh Canada that's right you guys remember when we did that on stream you're an OG if you remember that okay so now this version is actually running we're gonna let this run I'm gonna close this this um yeah we were downloading like all these access files and we're downloading like multiple versions of the same stuff I don't know what fnds is oh these are all like they're really old data sets we'll leave this for another time I don't understand how these are so small so so small all right our local version starting download G Flash we've gotten most away through these and we're also running on kaggle what are we doing we're downloading data from you the FDC FDF FCC you know the Food and Drug central place I'm going to put the note book in here thank you for my videos yeah thanks for hanging out plus one from Brazil on Twitch let's see is Brazil moving up yes Brazil's gone up to five represent Brazil love it I love it all right so this one's running on kaggle ideally we'd have it running on kaggle because that way when they release this every year it looks like looks like twice a year we can have this automatically run every six months and download the latest data and put it all in the same format the next thing as someone else mentioned is we're going to try to grow through this data once it's unzipped and actually put it into like a workable format we can also change out a lot of these CSV files for parquet formats and try a handful of other stuff why is this like Ampersand in here let's go Argentina yes can you find who is Russian hey it is it is what it is Brazil is definitely in the house Brazil and Argentina just took over Canada and Japan like it was nothing like it was nothing they were relegated love from Taiwan nice I love I love this Global Community we have going on here oh whoa what is this this community of peeps okay so starting to unzip and then it unzipped them all and then we have this unzip also remove the file so now if we look at this we have all of our files with our data okay so this notebook I think is dunski it's good to go and also let's see what how much disk space it's using now d u Dash H that's 11 gigabytes that's a lot of data if you think about it just being a bunch of food data that seems like a lot to me but it's not 74 so it shouldn't break the kernel The Notebook like it did before hey look this is done version three of three is done running on kaggle so now if you now if we refresh this can someone check this out for me I'm gonna put this hello from India yes what time is it in India you up early are you up late yes you can see it okay so this notebook is public you can see all the code we did to download the data we're going to go to the output and we're gonna actually create a data set from this uh new data set keep data in sync with new notebook versions USDA is it USDA food data central database should we add the burger should we add the burger to the data set name hey freezy just subscribed thank you for subscribing um I usually do this I spin this wheel whenever we get a subscriber spin this wheel and I do whatever it says to do hey we got another Michael thank you you guys don't need to but hey I'm gonna have to scream Kevin here that was for you thank you so much for the subscribing and again what's it going to land on tell a dad joke okay so here is here's my dad joke I've been told a dad joke in a while can someone send me a link to a good dad joke where do boats go when they get sick does anyone know where boats go when they get sick to the boat dock that is a rare one I don't usually land on that one but yeah thank you guys for subscribing uh Michael to 20 and freezy all right let's go back to our code I'm gonna rename this to EP ipy notebook and I'm realizing that I got thrown off there we need to finish this so we're making this a data set we're going to create it and then we'll make that public good night guys already 4am here in Morocco whoa thanks for staying up late for me or you might have been staying up late anyways but thanks for hanging out with us late so write that down the best ideas is to say so write that down after any dad joke that's hilarious I like that Sasha that is a plus so write that down that's so dad like okay so I think the data set is being created I hope we can refresh this in a little bit the way it should work is if this notebook ran and I made a data set from it then it should stay in sync with this notebook and I can set up a schedule for to run and we'll have it run like every month let's say how often do you come live I I used to do it three times a week but now it's been like once a week Tuesday Thursdays or Sunday but sometimes more if it if there's all this fun maybe more will the script be available yes I already sent out the link but it's also if you go to kaggle.com it's my account if you just search my name kaggle you go to data sets hey you can upvote a lot of my data sets feel free to do that if you like them but you can also um go to new and actually it wouldn't be in here it would be in code recently run and it's right here okay so I've had this happen before where I try to create a new data set from this but I've already created one it just has a 404 error should I try again maybe just try one more time and if it doesn't work create new data set enter data set title USDA food data Central database keep in sync create it's going to error out if it's already creating it something with this name okay so it's actually processing this time that's a good sign hey man come more often this is the kind of live stream all upcoming data scientists need yeah it's fun it's definitely fun we've been doing it for like a year or so no it was loading while after you clicked the button uh so it might still fail let's refresh this yeah that's definitely down uh oh upload fail please try later I don't know why it's saying please try later um let's try food Health Data let's try creating a new name not syncing it maybe it's too many files maybe I need to file a ticket about this not working but if this one fails we'll just go from we'll just keep on working locally so we're gonna make this a notebook file let's just copy most of this in and let's make sure I'm using this latest version wow everything's installed okay so we're gonna do a star on this start at CSV this will show us all of the CSV files I think we want branded files hmm But there are not that many ones that only have branded food in them what what's going on here food input food branded food calorie conversion factor food portion food nutrient source just straight up food Foundation food I would think each folder I would think each folder should have a branded food CSV file but it doesn't everyone have its own food folder no so let's go in and see what's going on here this 2019-1217 does have a food file what's one that doesn't nothing from 2020. oh it's like a subfolder within that folder why did it do that whoever zipped these files did it different each year so if I add a star here and then do branded food that's annoying basically I want to take all of these files that are two directories deep and move them up one so how do I do that so move all files to directories deep up One Directory so if I do a glob on star dot star Dot like this this will give us all the followers that are it's almost 5 40 for me here and I will go to sleep and watch the live record yeah you can watch the recording um use from tqdm contrib logging import login redirect tqdm when you want to log oh is that what was going wrong MMD why you still up he said it was late for you and you're going to bed I mean I'm not yelling at you I'm just hey man come to your office kind of live streams oh yeah that was from earlier thanks okay so so we can use pathlib we want to use move files I mean I know I could go Okay so I think what I've done here is I've created a list of files to move four file in files to move let's see if copilot can help us up here move the file up one directory let's see what it does here come on give me something Copilot uh this might work so it should take all downloaded table record Counts from this directory and have moved it up One Directory let's see if that worked it didn't work so it took this and it moved it to the base directory yeah it moved it here that's no bueno it should be in this 2020 10 30 directory there we go so what we wanted this is in split that's this and then the file name zero and we want the new file to be this right want it to be like that so this is and let's see if this works um yeah I had a break here so let's just make sure this last file name that we ran it on worked food nutrient conversion factor yeah now the food nutrient conversion factor is in the right spot so let's go ahead and run this on everything and now let's do this and we see it's all here we could actually remove those directories but it's fine to have empty directories and now let's now let's look at all Grant branded food so this will be star slash branded food.csv now we have them all look at that Pat that resolved that parents up Wonder oh nice so this is the way to use pathlib why isn't this letting me copy copy could have used pathlib something like this but what the way we did it worked so now we have all branded we haven't even looked at data so these are all of our branded csvs see this is the sort of stuff that co-pilot doesn't necessarily help that much with it's trying to auto complete all my stuff into things that don't make sense like making a a head statement on that Benjamin what's up welcome to the Stream d-type warning have mixed types so we're going to have to do low memory equals faults what's a food what's a type of food that you guys like type it in chat type the type of food that you like the most I'm talking like pizza soup someone in chat tell me a food tacos here we gotta vote for taco steak spaghetti water water million come on texts spicy someone just said spicy my favorite food is spicy okay let's go ahead and do a low memory equals faults on this Portuguese something of okay first person said Beyond wait Beyond here lies nothing said tacos so we're gonna go with tacos there's not that many columns actually what's the shape of this data so we it's more than one million branded Foods and we're only pulling from the first CSV file which is from 2022. I think that's probably the most recent one Indian curries uh keep the food comments coming I like it brand owner ingredients branded food category let's look at these maybe we could do it in here maybe in the Branded food category there's going to be some tacos um so let's just look at the top 20 branded food categories um let's also let's Auto format this so it looks a little bit prettier and let's add a title top brand food categories so we do have salsa in here um but let's try to see can we find tacos can we find tacos see this is where I don't like that I can't switch this easily to a markdown cell there we go that works let's see the number of unique ones in here 359 so we should make this as type category let's set this type to be category and let's see if the anything contains taco um other thing here is we want to see if it contains taco and fill in a with faults so it's not going to take contain Taco because we need to lower the the text because we didn't have an uppercase and lowercase and now we go to branded Food calorie and value counts on this and taco shells is what we have so we do have seven different uh did you use a keyboard shortcut to swap from cell from code to markdown yes I do I have a whole video on this on YouTube so usually I do it oh geez in a normal notebook and if you do Escape M it'll make it a markdown this is is marked down if I go to from M to y normally then this is code it's much smoother in the actual notebook oh there's an ignore case nice so you're saying I can do this uh Flags read uh ignore case I didn't even need to import re that's weird um so let's do this and let's see DF taco shells let's do a copy of this a lot of Old El Paso here what's the shape of this only seven so it's all Old El Paso it's not going to be that interesting to compare hey bro what's up what's up ultimate saksham hopefully I said that right so maybe let's look at this branded food categories that that are a little bit more frequent since the taco shells aren't really showing up as being that popular let's look at some candy which why isn't cereal in candy bda233 thank you for subscribing we're gonna spin the wheel for you Levy vix DVC maybe data Version Control scream Kevin again Kevin I did it for you thank you for subscribing you know what dips and salsa would be kind of fun also just looking at sodas that would be fun Everyone likes sodas right except for people that don't um it looks like there's a lot of sodas if is there other ones that are like drinks fruit and vegetable juice and fruit drinks Breads and buns in Texas we called sodas Coke yes and up north or in Midwest they call it pop right what it says sodas here oh soda soda um let's see the brand name so we should see like Coca-Cola up here at the top okay so it doesn't have Coca-Cola up here well then we need to figure out why we do Flags a Nord case look copilot uh otter automatically did that we're gonna have to fill in a it's false uh so Coca-Cola this should be a soda in my opinion am I okay so that's in non-alcoholic beverages ready to drink that should be up here then is that up here does like does every company just get a pick their own branded food category what's going on here let's see if we now if we do this same thing with non-alcoholic beverages ready to drink and see the brand name counts what the what the what looks like it has a weird space in it there we go polar V8 Welch's Gatorade Coca-Cola these are all the big names that we were expecting to see is Pepsi here Pepsi is here okay so now we got let's instead of saying let's compare sodas let's say let's compare non-alcoholic beverages ready to drink all right but I'm going to call this sodas and I'm going to copy this make sure we have a copy of it um and see how many we have all right so over 5 000 almost 6 000 sodas that's only the top 20 categories yeah yeah this is top 20. I was just thinking shouldn't sodas or non-alcoholic beverages ready to drink be in the top 20 because there's so many different of those um let's see if we can find some interesting information like okay serving size uh let's sort values by the serving size let's see what the biggest serving size you can get is ascending equals false we'll look at the top one qtg is 960 milliliters oh that's another thing we should check is what is the serving size unit all right so we have gallons and milliliters we need to take it when it's gallons and convert it to milliliters so they're all in the same uh format convert gallons two milliliters I should we trust copilot on this let's locate where serving I'm getting all screwed up by Copilot let's see where it's gallons it's just a g that value is correct one gallon is that many milliliters okay so let's take every time where this is G which is 95 rows and we're going to take serving size and we're going to multiply it by gal to ml that can't be right because that would be way too many g means means to be gallons or grams oh yeah you're right okay so it's just dividing it by this does that look better still really high right it's correct I don't think it is because then it would be larger than every other serving size we have you know what we're just gonna we're just gonna scrap that for now yeah definitely more reasonable than the gallons one but let's try to let's try to figure this out did they just put the wrong thing in because 240 milliliters would make sense if we look at the serving size for sodas 360 is a more most common and then 240 for milliliters that is and then for this G Unit G Unit key error serving size uh that's right because I'm doing this so this is 243 is the most common one which would seem like it's actually the same unit I think they're just messed up a hundred dollars per year what is co-pilot free for everyone no yeah it it costs money it was free and then they started charging me and I haven't turned it off yet I still want to try it out all right so we serving sizes this qtg it has a huge serving size why and what is this beverage FDC ID so we want to pull in see this is the autocomplete that I don't actually like with with copilot I want to actually auto complete so let's turn copilot off disable for python um survey f n d d s food we want to find what actual FDC ID lines up with so let's actually go here and actually let's go into that directory and I think there's like a PDF file or something that explains it oh I'm in the wrong directory 2022 04 we're going to open this and there are no PDF files PDF files in here so we've looked at this before I think to try to get idea so what I want to do is find fdicd FDC ID so this should be a unique identifier for the food in the table yet unique permanent identifier of a food in the food table so do I need to join this with the food table let's see all the places FD this is like our our column that we can join stuff on acquisition date Market class treatment and State that's not information I'm interested in branded food this is the file that we're looking in right now the brand owner of the food the list of ingredients the amount of serving all right that's fine food class for internal use only okay that seems like I want to look at it data type description description of the food so this it's in the food just the straight up food file so what was the file name that we pulled in all this from I pulled in this one and now I just want to pull in food probably need low memory equals faults oh no we don't um so now we have all the FDC IDs and then a description of it so if we do soda merge food how equals left then we're going to validate that on the it should be a one-to-one mapping I'm pretty sure but we'll find out okay so now this did work just checking to make sure that we have the same number of rows on the left as we did before and after and now we have this description column so this is going to go up here badowski do you have a license what do you what is bogoski copilot is still pretty amazing to me it is it is nice it is pretty cool certain things that when I want to work fast in a notebook it's like come on copilot just step aside for a second so now if we sort by if we sort by this we have this qtg which is the hugest serving size we've ever seen I want to see what this freaking product is it's called Gatorade Fierce blood what in the world Gatorade fears blood and intense strawberry thirst quencher 32 fluid ounce plastic bottle and the serving size is actually 32 ounces Gatorade what are you doing what are you doing making your serving size I feel like this is an outlier to the extent that it it might not actually be true but it does say 32 fluid ounces um okay so this is just stupid why do they say just plastic bottle that doesn't help us out um so the columns that we're interested in now is let's say this FD CID brand owner brand name okay so the brand owner and brand name are pretty similar package weight modified date available date short description what's a short description just spring water all right so I think we need to go back to the PDF and see what else this FDC ID can be joined on to give us something interesting or can I I think I've done this before we can just take this FDC ID open up a incognito window and search for it so this is not giving us FDC Dash Maybe Gatorade no so I thought maybe if I just search that it would give me the um it would give me like a link to it nope these these are not working so all we know is like this is a bottle of spring water we don't know the cow we want to know like calories nutrients that's the stuff we want to know and then we want to actually link it to the legit name food component gram weight data points Minier acquires that's weird food nutrient all right so let's pull in the nutrients pull in new tree how do you spell nutrients so this should be in food data nutrient.csv is this ID the F DC ID no no no this is not it food nutrient and there it looks like there is no file file called food nutrient no there is and there's a conversion factor table is this a big one 1.3 gigs sheesh that's the big one is this gonna load all right what are you guys saying Gatorade is owned by Coca-Cola and Powerade is owned by Pepsi ooh you know what would be really interesting is seeing what company actually owns all these Brands I think that's what this brand owner column is supposed to tell us prism like if Coca-Cola owns it I think the brand owner column should say that but we can look that up what happened to ice data we talked about that earlier imposter engineer um to fill you in they're having like a early sharing competition where people are supposed to share what they've learned just in the first few weeks but they're not releasing that for another few days those probably won't be public for another few days so it seems a little bit strange to actually it seems like we should uh hold off on diving into that until Sunday at least when all those are released because then we can look at those okay I think I see what's going on here I think I see what's going on here all right let's just take something from our sodas that we like something I'm familiar with like query should we do it like this we want the brand owner to be cook Maybe fill in a is false um this is where copilot could be helpful we want it to be Flags [Music] what do we do before Flags equals re ignore case the Coca-Cola Company all right so the Coca-Cola Company Minute Maid sparkling uh Coca-Cola let's actually find where the brand name is Coca-Cola just something we we know this is like the main Coca-Cola F DC ID let's see if we Google this one if anything comes up nope uh we got Coke Zero wait is there Coke zero Coca-Cola Cherry zero okay let's take this one as our example this Coca-Cola Cherry zero with FDIC equals this so we're gonna query the f d c ID equals this which is our Coca-Cola Cherry zero does it have seven up I'm sure it does the prophet shot you've been learning stuff from the stream awesome they're from the videos I appreciate that all right so we now have this nutrient ID so this is just one product a very specific product but in this nutrient data set we have the nutrient ID and the amount how is it all zero what nowhere what is this nutrient ID the ID of the nutrient which the food nutrient pertains all the amounts are zero come on Coca-Cola Blue Sky zero cherry vanilla is that gonna give us something jeez let's even go big let's just let's just see if any of the sodas FDC ID uh unique let's go into this and call this soda IDs right let's see if any of these are in soda IDs and yeah okay so now we have some amounts that are not equal to so this is the sodium soda nutrients we'll reset this index drop it and copy this need to figure out why some of these are just straight up zero more decimals Maybe 100 milligram is too little do you think a word cloud of ingredients could be uh yeah potentially let's try to do that ingredients for but I want to find out why the amount could be zero so we don't even know what these nutrient IDs are let's do this soda nutrients nutrient name no nutrient ID value counts so we'll at least find the most common nutrient IDs 1003 but we don't know what the nutrient IDs are because we need to read in we need to read in nutrient CSV that's in nut details so now we should be able to set this index as the ID and take the name hmm unit name why does that have a rank let's just try to merge this onto here soda nutrients soda nutrients merge these nutrients left validate one to one I mean it is cool that copilot will automatically do that stuff for me but should we have a column now that has this name but they're null here oh I know what I did wrong um so we need to do this that's why we shouldn't just trust trust left on nutrient ID and right on it's just going to be straight up ID and then we'll add some suffixes so this is going to be underscore nutrient and the left is going to be nothing yep yep yep yep okay so here we go merge not unique in the left data set not many to one so this actually should be a many to one because the nutrients can be repeated in the left side but not in the right uh so let's Group by the name of the nutrient and do the amount uh and sum that let's do that why is my phone buzzing FDC ID isn't unique I think it is oh well it's not going to be unique in the the um nutrient data set correct but and and our food data set above here like this sodas if I look at sodas and I do FDC ID value counts on this they're all one if I do it on our soda nutrients it's repeated depending on the mount uh how do I how do I say this depending on the number of nutrients in that food so I think what I could do is soda nutrients Group by I kind of want to merge this the data information too the food on F DC ID many to one again this should give us the name of the food there we go now if we Group by the name I think we're going to do something cool here and then we'll um then we'll take a little break but here we go the name is the name of the nutrients so we kind of want to group by the F dcid so it's a unique thing take the name and then take the amount and like do a sum and then do like unstack or something fill in a as zero so what is this doing this gives us each unique unit each unique soda in our data set here is on the left side and now we have a column for each nutrient that's available now not many of them don't have anything for the nutrients so keep that in mind so this is like a pivot pivoted version of it and what can we do with this if we sum and sort values here this tells us wait energy is the top what the how is energy how can I highlight an outlier using confident interval both lower limit and upper limit use Seaborn Raj good night MMD thanks for hanging out that's like the fifth time I've said that tonight please sleep it's really important for your health why is energy what is energy is energy a nutrient energy equals calories oh so when they list out on the food when they list out the number of calories that includes all these other nutrients right so it's kind of like a combination of everything but maybe that's a good place to look great call there though uh maybe this is where this Food calorie conversion factor needs to come in the multiplication factors to be used when calculating energy from macronutrients for specific food yeah so if we assume the energy is calories yeah this is vs code uh shoot boy so if that's the case we basically we could just filter down to the nutrient the soda nutrients not use the pivot data table where query name equals energy and sort values by energy um no by amount all right so the name of this shouldn't it have the food description merged here oh okay okay um maybe suffixes so we're merging here on the food data set which has the description of the food name like the this description but that's not what we're seeing here we're just seeing plastic bottle aluminum cam V8 beverage vegetable yeah so it's saying that these the really high calorie counts are these just aluminum cans which doesn't make any sense aluminum bottles plastic bottles we also want to we kind of want to merge on the brand name we want to merge on the uh what's it called sodas probably should have just done this to start with just want to figure out what are the most unhealthy drinks that's what we're trying to figure out we're trying to get to the bottom of this this is investigative data science right now trying to expose all of the bad unhealthy companies I'm just kidding we're just having fun you need a query for cholesterol in sodas is there a lot of cholesterol in sodas potentially all right so now we have a mount and we have brand see what columns we have data type branded yeah let's look at that wait they all should be these sodas branded food brand new food category not a significant source of brand name sub brand name brand owner these are the things we're trying to look for uh polar beverages is just up here what is polar beverages polar beverages these guys they're just I think they're bad at data entry is what they are because there's no way that their energy for this plastic bottle is like more than eating a Chipotle burrito big soda is bad but sometimes is ice cold Dr Pepper my guilty pleasure I'm with you I mean I drink beers too and it's less healthy so what is this what is going on here let's Let's ignore all these polar all right so then the top thing that comes up is Coca-Cola Cherry 7.5 fluid ounce this makes a little bit more sense but there's no way it's this many calories right um so it's saying that these are the Coca-Cola Cherry Coke ounces 7.5 ounce cans 1333 it's KCAL um let's look at the columns is there a nutrient unit name it is KCAL uh so divide by that oh divide by okay so so we're just saying it's 1.3 calories then oh wait 133 calories right something about the Cherry Coke though who'd have thought I always get the cherry flavor thinking it's tastes so much better but who'da thought that those are gonna be at the top we also have to make sure it's the same size um right if the serving size isn't the same then we're not comparing Apples to Apples especially for drinks I mean I guess you could say well why is the serving size that big to begin with 130 calories for one drink actually isn't that bad we need like a normalized maybe a future thing for this project is to like normalize oh shoot my kernel timed out to normalize it by the size so I bet one thing we could also look up is we can validate some of our pre-conceived Notions about the data by doing something like this what is the most unhealthy soda Fanta pina colada Fanta pina colada that's just number 29. hey Cersei thank you so much for um Cersei you're you're the best thank you for subscribing with Prime it's great take that money out of Jeff bezos's pocket and give it to me I think the main part is that healthy soda are carcinogens such as fake sugar yeah so it's landed on sign to Mike and I'm going to do that for you ah nice little sigh there we go did our data set creation work I did not we might need to filter down those data sets before we create the data set um but let's try to see if we look up that one that we think is high sorry let's just show five of these let's go to this worst sodas I don't want to be part of your newsletter no stop it turn the ad blocker off will that help oh no jeez the site really hates me uh Hey mgx welcome back to the stream hope you're doing well thanks for subscribing let's go ahead and spin that wheel let's spin the wheel for you uh it's spinning it's spinning I promise third time's the charm come on it's for you mgx um so Mountain Dew is supposed to be bad regular Coke Pepsi Zero sugar oh okay it's the the thing is we're looking at the high calories highest calorie soda Rockstar Energy Drink has 260 is Rockstar in this data set did we completely pass over Rockstar um we need to soda nutrients uh query no description contains do we have anything relating to Rockstar ignore case so we do have some rock star drinks let's see what they say the calories are in those and then we'll be done let's let's wrap this up it's too much time looking at sodas one too many brackets all right Rockstar pure zero silver ice energy drinks amount serving size so this just says 75 is the amount of energy so we need we're doing something wrong hello how are you doing I'm doing good how are you doing guys we've been streaming for a while now was this shortcut for auto lint Rob just use oh it would you need to look up lab black or NB black extension I'll show you here this is it you add this to your Jupiter cell and it'll automatically run the black Auto formatting every time you execute a cell it doesn't work well when you have like a lot of co uh processing that's going to run when you run the cell and then you want to edit it as it's running because it'll Auto replace what you wrote milligram is .075 KCAL yeah I don't know we gotta learn more about food so let me just put some links what's a good Linux I IDE for data science I have a whole video on that so actually exclamation point YouTube www.youtube.com let me just type it in here youtube.com Rob MOA this should bring you to my site and I have a whole video on my setup um which I run Linux on so vs code is pretty good if you click that link you'll come here um vim's great too exclamation Discord will bring you here to my Discord exclamation point um so my twitch stream if you're not watching on Twitch you should follow me over there which is here by the way we have 13 people from us Brazil you got overtaken by Canada here there was a lot of Brazilians in at one point but they're all gone uh dumb I'm done Vin tutoring emac shooter both are awesome yeah well you got to just choose one though what else do I need to link to Discord oh you can look at me on Twitter Rob underscore Mulla I believe is mine make sure that's right yeah this is me just chilling and that's it so thanks everyone for hanging out tonight I'm going to go ahead and end the stream here any other questions before we end I want to make this into an actual kaggle data set because because I actually am getting kind of close to becoming a kaggle data set screen Master for whatever that's worth um but yeah it's been fun hanging out with you guys tonight was there a Brazilian Oxford Energies drinks a type of soda yeah they are apparently ones that are very high in calories um I'm going to look and see if people are coding in Python hey should I prism thank you for the subscription for subscribing um should I spin the wheel for you what's the verdict is it too late no it's not too it's never too late I'll spin the wheel for you watch it land on like 20 push-ups or something drink water well I got this I'm gonna drink this for you cheers and thanks for for the sub I appreciate that I'm going to find someone to raid mawixi is doing code Wars python I think is that python yeah okay so we'll do a little bit of a raid stick around if you're gonna leave go ahead and leave now if you're on Twitch um yeah we'll raid and be positive and be encouraging and yeah so I'm going to load up this raid and I will see you all next time thanks for hanging out with me tonight and it's been a lot of fun let's do it again soon 10 seconds any last questions are people getting employment based off of kaggle alone yeah sometimes it does happen all right we're rating bye YouTube have a good time thanks

Original Description

Live Coding in python and Data Science! Notebook: https://www.kaggle.com/code/robikscube/fooddata-central-data-scrape
Watch on YouTube โ†— (saves to browser)
Sign in to unlock AI tutor explanation ยท โšก30

Playlist

Uploads from Rob Mulla ยท Rob Mulla ยท 0 of 60

โ† Previous Next โ†’
1 A Gentle Introduction to Pandas Data Analysis (on Kaggle)
A Gentle Introduction to Pandas Data Analysis (on Kaggle)
Rob Mulla
2 Exploratory Data Analysis with Pandas Python
Exploratory Data Analysis with Pandas Python
Rob Mulla
3 7 Python Data Visualization Libraries in 15 minutes
7 Python Data Visualization Libraries in 15 minutes
Rob Mulla
4 Kaggle competition starter notebook walkthrough
Kaggle competition starter notebook walkthrough
Rob Mulla
5 Kaggle Competitions: A Beginner's Guide to Winning
Kaggle Competitions: A Beginner's Guide to Winning
Rob Mulla
6 Jupyter Notebook Complete Beginner Guide - From Jupyter to Jupyterlab, Google Colab and Kaggle!
Jupyter Notebook Complete Beginner Guide - From Jupyter to Jupyterlab, Google Colab and Kaggle!
Rob Mulla
7 Audio Data Processing in Python
Audio Data Processing in Python
Rob Mulla
8 Complete Data Science Project!
Complete Data Science Project!
Rob Mulla
9 Make Your Pandas Code Lightning Fast
Make Your Pandas Code Lightning Fast
Rob Mulla
10 Image Processing with OpenCV and Python
Image Processing with OpenCV and Python
Rob Mulla
11 Speed Up Your Pandas Dataframes
Speed Up Your Pandas Dataframes
Rob Mulla
12 This INCREDIBLE trick will speed up your data processes.
This INCREDIBLE trick will speed up your data processes.
Rob Mulla
13 Complete Guide to Cross Validation
Complete Guide to Cross Validation
Rob Mulla
14 Easy Python Progress Bars with tqdm
Easy Python Progress Bars with tqdm
Rob Mulla
15 Economic Data Analysis Project with Python Pandas - Data scraping, cleaning and exploration!
Economic Data Analysis Project with Python Pandas - Data scraping, cleaning and exploration!
Rob Mulla
16 Python Sentiment Analysis Project with NLTK and ๐Ÿค— Transformers. Classify Amazon Reviews!!
Python Sentiment Analysis Project with NLTK and ๐Ÿค— Transformers. Classify Amazon Reviews!!
Rob Mulla
17 Get Started with Machine Learning and AI in 2023
Get Started with Machine Learning and AI in 2023
Rob Mulla
18 The Trick to Get Unlimited Datasets
The Trick to Get Unlimited Datasets
Rob Mulla
19 Video Data Processing with Python and OpenCV
Video Data Processing with Python and OpenCV
Rob Mulla
20 Object Detection in 10 minutes with YOLOv5 & Python!
Object Detection in 10 minutes with YOLOv5 & Python!
Rob Mulla
21 Pandas for Data Science #shorts
Pandas for Data Science #shorts
Rob Mulla
22 Object Detection in 60 Seconds using Python and YOLOv5 #shorts
Object Detection in 60 Seconds using Python and YOLOv5 #shorts
Rob Mulla
23 Machine Learning for Facial Recognition in Python in 60 Seconds #shorts
Machine Learning for Facial Recognition in Python in 60 Seconds #shorts
Rob Mulla
24 Time Series Forecasting with XGBoost - Use python and machine learning to predict energy consumption
Time Series Forecasting with XGBoost - Use python and machine learning to predict energy consumption
Rob Mulla
25 Detect Text in Images with Python - pytesseract vs. easyocr vs keras_ocr
Detect Text in Images with Python - pytesseract vs. easyocr vs keras_ocr
Rob Mulla
26 Solving an Impossible Riddle with Code
Solving an Impossible Riddle with Code
Rob Mulla
27 Do these Pandas Alternatives actually work?
Do these Pandas Alternatives actually work?
Rob Mulla
28 Time Series Forecasting with XGBoost - Advanced Methods
Time Series Forecasting with XGBoost - Advanced Methods
Rob Mulla
29 Data Science Uncut - Data Shootout Kaggle Competition (Aug 1 2022 Stream)
Data Science Uncut - Data Shootout Kaggle Competition (Aug 1 2022 Stream)
Rob Mulla
30 Kaggle Dataset Creation from Scratch- Data Science Uncut (Aug 10 2022)
Kaggle Dataset Creation from Scratch- Data Science Uncut (Aug 10 2022)
Rob Mulla
31 Chess Board Computer Vision AI - Data Science Uncut (Sep 7, 2022)
Chess Board Computer Vision AI - Data Science Uncut (Sep 7, 2022)
Rob Mulla
32 25 Nooby Pandas Coding Mistakes You Should NEVER make.
25 Nooby Pandas Coding Mistakes You Should NEVER make.
Rob Mulla
33 DEFCON Hacking AI CTF Solution on Kaggle - Data Science Uncut Sep 11, 2022
DEFCON Hacking AI CTF Solution on Kaggle - Data Science Uncut Sep 11, 2022
Rob Mulla
34 More Chessboard Computer Vision AI - Data Science Uncut - Sep 13
More Chessboard Computer Vision AI - Data Science Uncut - Sep 13
Rob Mulla
35 Medallion Data Science Live Stream
Medallion Data Science Live Stream
Rob Mulla
36 Community Kaggle Competition Overview - Corn Classification (
Community Kaggle Competition Overview - Corn Classification (
Rob Mulla
37 Deep Learning Image Classification - Corn Kernels - Data Science Uncut
Deep Learning Image Classification - Corn Kernels - Data Science Uncut
Rob Mulla
38 OpenAI Whisper Demo: Convert Speech to Text in Python
OpenAI Whisper Demo: Convert Speech to Text in Python
Rob Mulla
39 Yolov7 Custom Object Detection in Python Tutorial  - Chess Piece Detection
Yolov7 Custom Object Detection in Python Tutorial - Chess Piece Detection
Rob Mulla
40 Live Kaggle Coding - Enzyme Stability Prediction - Data Science Uncut Sep, 27 2022
Live Kaggle Coding - Enzyme Stability Prediction - Data Science Uncut Sep, 27 2022
Rob Mulla
41 Finding Chess Cheaters with Python! - Data Science Uncut Livestream
Finding Chess Cheaters with Python! - Data Science Uncut Livestream
Rob Mulla
42 Data Science Uncut - Kaggle Community Competition & Chess Data Analysis - Oct 4, 2022
Data Science Uncut - Kaggle Community Competition & Chess Data Analysis - Oct 4, 2022
Rob Mulla
43 Flight Delay Dataset Creation (Data Science Uncut)
Flight Delay Dataset Creation (Data Science Uncut)
Rob Mulla
44 5 Reasons to Kaggle #shorts
5 Reasons to Kaggle #shorts
Rob Mulla
45 โ™Ÿ๏ธ Data Science - Chess Data Analysis
โ™Ÿ๏ธ Data Science - Chess Data Analysis
Rob Mulla
46 EXTREME PYTHON & DATA SCIENCE LIVE STREAM
EXTREME PYTHON & DATA SCIENCE LIVE STREAM
Rob Mulla
47 What is Clustering in ML?
What is Clustering in ML?
Rob Mulla
48 What is K-Nearest Neighbors?
What is K-Nearest Neighbors?
Rob Mulla
49 LIVE CODING: Flight Data Exploration with Pandas & Python
LIVE CODING: Flight Data Exploration with Pandas & Python
Rob Mulla
50 Kaggle Survey vs. Twitter Sentiment
Kaggle Survey vs. Twitter Sentiment
Rob Mulla
51 If Top Chess.com Players were STOCKS - Live Coding Data Anaylsis Stream
If Top Chess.com Players were STOCKS - Live Coding Data Anaylsis Stream
Rob Mulla
52 Data Visualization BATTLE!
Data Visualization BATTLE!
Rob Mulla
53 LIVE CODING: Stocks & Sentiment Analysis
LIVE CODING: Stocks & Sentiment Analysis
Rob Mulla
54 Progress Bar in Python with TQDM
Progress Bar in Python with TQDM
Rob Mulla
55 Flight Cancellation Data Analysis
Flight Cancellation Data Analysis
Rob Mulla
56 Synthetic Dataset Creation for Machine Learning - Blender and Python
Synthetic Dataset Creation for Machine Learning - Blender and Python
Rob Mulla
57 The Ultimate Coding Setup for Data Science
The Ultimate Coding Setup for Data Science
Rob Mulla
58 Dataset Creation SPEED RUN - Live Coding With Python & Pandas
Dataset Creation SPEED RUN - Live Coding With Python & Pandas
Rob Mulla
59 Data Wrangling with Python and Pandas LIVE
Data Wrangling with Python and Pandas LIVE
Rob Mulla
60 Forecasting with the FB Prophet Model
Forecasting with the FB Prophet Model
Rob Mulla

This video teaches how to create a food nutrient dataset using Python and Kaggle, covering data scraping, cleaning, and analysis. It demonstrates the use of various tools and techniques for data manipulation and visualization, and provides a comprehensive overview of the data analysis process.

Key Takeaways
  1. Create a Kaggle account and set up a new notebook
  2. Scrape data from the USDA Food Data Central website
  3. Clean and preprocess the data
  4. Analyze and visualize the data using Pandas, NumPy, and Matplotlib
  5. Create a dataset and automate data processing using a pipeline
๐Ÿ’ก The video highlights the importance of data cleaning and preprocessing in data analysis, and demonstrates how to use various tools and techniques to automate these processes.
๐Ÿ”’ Pro feature: Ask AI to explain this lesson โ†’
Up next
Live Coding - Watching my Model Train for Kaggle
Rob Mulla
Watch โ†’