Dataset Creation SPEED RUN - Live Coding With Python & Pandas

Rob Mulla · Beginner ·🎨 Image & Video AI ·3y ago

Skills: LLM Foundations80%ML Pipelines70%Prompt Craft60%Fine-tuning LLMs60%Data Literacy50%

Key Takeaways

The video demonstrates dataset creation using Python and Pandas, specifically for SEC Filing data on Kaggle, utilizing various tools such as Kaggle notebooks, Beautiful Soup, and Pandas data frame efficiency package.

Full Transcript

like every breath all right all right let's get to it hello everyone it is Sunday November 13 2022. thanks for being here with me today we are gonna go on a speed run are you guys ready for some speed do you feel the need for speed I hope you feel the need for speed I feel the need for speed I'm pumping the mic Mike we're getting ready it's gonna be exciting night and um you know what the amount of time that the length of this stream depends on one thing and one thing only and that's how long it will take me to finish this task so let's go ahead drink some water and then I'm going to show you the task and we're going to see um if it's possible to do it okay so let's go ahead and switch up here to um my single cam and uh speedrun yes that's for sure by the way if you want to join the chat the best place to join the chat is over on Twitch you can see the link to my twitch is up there and let's go ahead and test the twitch chat yes it is loading twitch chat is the main chat so join us over there on Twitch hope you're doing well tonight hope you all had a great weekend looky here looky here we have a data set okay so this was brought up by imposter engineer long time follower Hangar outer on our Discord by the way if you don't are if you're not part of our Discord I'm gonna go Discord here get the link and I'm going to send this to you all because you should join our Discord uh where it's a fun place just to hang out and chat ask questions and learn more about data science but he was posting on their question about a specific data set that has to do with financial Financial financial statements from the U.S Securities and Exchange Commission also known as the SEC I'm not talking about SEC like Southeastern Conference football we're talking SEC like government um we got f311 in chat welcome from YouTube well I'm glad to have you over here all right so he's talking about this data set and I thought it would be a fun challenge to make this into a kaggle data set um from its current state so you can see that there are I haven't looked at this before so I'm coming at this completely from scratch I haven't cheated I don't know if that's considered cheated because like I'm making up the rules but um so what we're trying to do is make this data set into something that's more readable and easy to work with on a kaggle data set I haven't looked at the columns I haven't even looked at what the format of the data set looks like but I think we should just go ahead and start over on kaggle and just get started with it let's what I want to do is make this data set creation part of oh excuse me part of a a notebook on kaggle and the reason why I want to do that is because you can see here at the bottom it says it was modified in November of 2022 now I made it so big that that went away modified September 30th 2022 so I'm assuming this data set may be updated in the future I mean like I I hope it would be otherwise our financial system or the SEC would be completely shut down but it looks like they update it once a quarter so I want code to run out there that'll pull this data down for us and it will uh create a data set whenever the new data is formed so but we're going to figure out how to do that SCC data set creation so I did look on kaggle and this is probably a good thing to do first is uh is to look to see if there are already data sets existing so it looks like someone's created these back in 2020 so there are a few this is a month old but it only goes to 2020. what we're looking to do is have something that goes the complete length of this data set as far back as it goes so it looks like back to 2009 and the update on its own every quarter um but we're gonna make this work so let's do import first of all let me know if the volume is okay if I'm not too loud and stuff and also let me know if you can hear and see things clearly I'm going to go ahead and put this dark theme on because I think it makes it easier to read my screen when I'm coding uh but feel free to ask any questions look we got a Spam account we already have someone in YouTube chat who's a Spam account that means that I've hit it big people if people are spamming us already and now I can hear myself oh and they've already taken it down okay um so we're gonna import pandas as PD and we're going to import numpy as NP and then we are going to maybe import beautiful soup uh that sounds like a good idea what beautiful soup will allow us to do is to search this site this website so let's go ahead and um let's let's back up a second goal of this notebook let's make some goals let's make some life goals uh pull all the most recent data from the SEC website which is this number two download well pull links to the most recent data number two is download the zip files from the urls unzip and format data in a readable data into workable format let's just say workable that's vague enough that I can just skirt my way around it if I don't know how to do that and number four will be to save as CSV and parquet format this these were my favorite formats well my favorite format is to save as parquet because then we can easily read and write the data um and yeah we're gonna do this okay so let's go ahead and check out this website in beautiful soup all right so beautiful soup for uh pull all zip files from URL I've done this before but we're gonna just look it up and and do it again okay so um so this is an example where they have they want to get all the data from a URL so we're gonna we're gonna open import beautiful soup uh uh wait no is it from bees bs4 import beautiful soup there we go and then we're also going to import requests requests will let us actually pull from the URL so we we're giving it this come on come on Notebook don't fail on me here um so we are not going to use this URL we're going to use our URL um what is this doing this is getting this URL uh it's going to use beautiful soup to get the content from this URL using the HTML parser that's our soup and then I don't know what this is so let's go soup on this let's go soup to you know what so this is that whole web page this is this URL we have pulled it and we have like the raw HTML here um but that's not what we want we want the links to these zip files so how will we get that we could do link dot find all so this is their solution they're also using we gotta import re for regular expressions they're saying for link and suit find all and a is I think for I don't know I'm not sure um oh this actually has them opening it so four link in soup dot find all let's try to do find all attributes is I don't know if this is gonna work like star.zip oh print link that did not work is capital like a Jupiter notebook exactly yes it is just like what's up pineapple welcome what's up databasics okay so how do I get this how do I get these uh zip files bs4 for beautiful suit four um find all zip files download old zip files from Google parent search they're saying to do it like this is super old response if they're printing without a quote around it this is that means it's python 2 solution how old is this this is from 2015. what am I doing here we need something at least in this decade okay this is a more recent soup select what the crap is all this Elms soup select bullet list left oh geez how do you comment a block of code all at once um someone's asking that in chat but okay so can control backslash super easy way to comment it all out all right so can we give a soup dot find all string equals zip okay so there's an a for they're doing list comprehension let's see this and zip in text ref nope let's see where I did this before I did this before on I'm gonna open up my terminal here and I'm going to go up a directory into my actually into my repos directory I'm going to go into my twitch twitch stream projects um actually I think it was in this the chess cheating analysis and then I had some scripts one of these had to be download pgns this is how I did it before get links all right so this is how I get links find all okay so basically this is the script that I need um let's just open this up in Sublime Text this function that I wrote and then I'm going to copy this in here and it's we're gonna just modify this all right so now we're going to get links and see what it gives us back okay now we have the links to all of the files we did it people we did it we're we're not halfway there we're halfway to doing number one we're halfway to doing number one uh big Larry good to see you too we got etsw and chat we got Yo Doo what you trying to do Diplomat we are trying to get some SEC following filings data I really should um put that link there oops I'm accidentally clicking everywhere look I'm downloading files by accident let's do this okay so we're gonna we got these links the thing is this link should be it's gonna be scc.gov plus this so I think that's kind of what I just deleted so it's going to be um www sec.gov Plus now the link should actually work did it download did that work if I click on this what does it do let's try to come on I click on it it makes it just disappear that's not correct all right there it goes it downloads it all right now I need to remember from this code how I downloaded the zip files um man this is just like all the code I've already done before this is like cheating all right so we're gonna put all the zip files just in the local directory Yeah so basically all we do is we pull in the file as its name we're just making everything into the local directory not into this data Zips directory this is pretty easy so we do get limp links and then this returns our Zips and then we download Zips come on I want to go to bed early let's let's do this and then unzip files this should be this should all work just Name download is not okay so download Zips is it should be correct also uh let's get rid of this print statement also put this stuff up here I see you all chatting there I'm going to get to you in a second just yell if like you can't hear what I'm saying URL lib dot request all right URL lib dot request let's just focus here it's downloading yes it's downloading people all right now that that's downloading let's check out the Chatters the chat what's up chat um I'm going all the way back to where databases said heyo pineapple it's more kaggle site with data sets yeah yeah it's kaggles more than just notebooks but it's got that good evening etsw said um no need to reinvent the wheel that's true I tend to do that more than I should all the time you're not cheating if you're using your own yeah but we're all we're all using other people's code because I'm using pandas and I'm using requests and I'm using beautiful soup and I didn't write any of that code no Nanda what does unzip files mean oh yeah this is gonna work this is gonna break um so basically what I need to do is I need to stop this and then I need to actually just run it on like the first let's say five so I stopped that and then I'm going to do LSL to see if the files are here they are downloaded so we have all of these so far downloaded and let's also add a little bit of tqdm uh from tqdm notebook we're going to import tqdm and then we're going to go ahead and wrap player Zips here in tqdm so when it's running we get to see that and also we don't want to print everyone um let's go ahead and delete this stuff to clean it up we can reuse this on a lot of sites basically anything that has a bunch of zip files uh we'll be able to make a data set from okay so our data sets here to my beautiful slash version so I can see the size so some are like biggest ones like 50 Megs I don't see this one's 53. so they're not that big but we're gonna just um let's remove everything that'll remove everything that's a zip file now if we LSG flash color this we will see nothing oh we'll see this virtual documents I don't know what that is and then we're going to retry this bad boy so it's going to run just for five it got all the Zips now it's going to get all five of those and then let's see if it unzips the files correctly um this shouldn't be PNG dirt glob is not defined okay so I need an import Club and what I think we could do okay we don't have to actually run this uh do I even know what this data set is I don't know that's what I'm realizing what are we going to even call this data set the financial statement data sets below provide numeric information from the face financials of all financial statements well that means nothing to me from the face financials of all financial statements this data is extracted from exhibits to corporate Financial reports filed with the commission so is it just everything yeah all right so my module object is not callable oh yeah cause from glob I need to import glob ES broseph let's also clean this up um we're gonna yeah beautiful soup should go down there maybe with tqdm this makes a little more sense I don't know why it makes sense to me um I don't even need to do this anymore and now we're gonna run it for these five I think if I go into maybe it's in the console no this does show my working directory what's in here I think I can refresh this and see what's what's there shutel is not defined so I need to import shutil hey guys just a reminder if you want to join the chat come over to us on Twitch the link is up there to my twitch and we'd love to have you um people were asking about uh uh heart rate monitors and stock market stuff I don't know yeah I don't know what to do next with the stock market stuff I think I need to download all that uh sentiment data offline but gonna be honest I just haven't been able to okay so here's the problem oh here's the problem folks we've run into our first real problem of the night and I keep on clipping the mic but um it looks like each of these files when we unzip them will have the same file names in it um so let's see these files that we have yeah I think these are all just the files that were unzipped from these last ones so if I do PD ah sorry back to this if we cat the num.txt and we just do a head command on this we should just see like the first few rows okay so there's a few ways we can deal with this what are the best ways to deal with this can we unzip them each to a folder I don't know so there's num there's pre uh there's a readme that might be a good thing to check out was this Microsoft Word no no I don't want to see this Microsoft Word give me a markdown file people come on sec the SEC won't let me be or let me be me anyone get that reference all right let's read csvp pre.txt I think we're gonna have to give it a special separator right uh that's gonna be a tab separator okay that parse is okay um yeah so like I was saying we could we could unzip each file to a specific directory then deal with it later probably that's a good way to do it or we could just unzip the files read them in as a CSV then concatenate them with whatever else we're working with and then save it off altogether but I think let's try the first way let's try the first way so so I think what we need is the extractor which is we can remove this from here but basically we're going to change this unzip files function to take all the Zips file names in the zip folder so basically if we do this right so this is like this ah ack double Act these are all the files we just wanted to unzip into a folder which is this file name ah FN split of this on this like like it's going to be one and then replace this that'll be like our out file name and that'll be our extract directory so now if we do unzip files they tried to shut me down on Twitch but it feels empty without a pitch that's a nice try etsw I know what you're trying to do if they don't know what you're trying to do that's fine but at least I know what you're trying to do okay we unzipped files let's LSL on this now we have directories that's good because now we can work in directories so now if I LS into this directory Dash L we have a readme HTM and all these num pre and all this stuff for just this so 2013 quarter four make that backslash T separator um let's Okay so let's just talk about what we've completed so far we have completed and of course the kaggle notebook like doesn't want to show me the first cell that I wrote which is above here it says oh yeah like I got I hacked around it by creating empty cells above it all right so we pulled the links to the most recent data we pulled all the zip files so we have those if this notebook was to run in two years it should be able to see all of the latest um let's clean this up all the latest zip files it would actually pull in the latest ones not just the ones that we're looking at now and then we we should be able to uh have an updated data set every time it runs we can set it to run every quarter or even every month just to be sure and then yeah um that's pretty cool we did number one number two is download the zip files from the URL so we parse the URLs to zip files we download them and we even unzip them into a format that's workable so this is where we're on we're on number three because we unzip them into folders but I don't want to actually have it do all of that now I can actually save this version now so that can run in the background but it's just running for the first five let's figure out those first five what we're gonna actually do with the data and then we can move on from that by the way if you want to join the chat twitch Medallion stallion right up there feel free to join us so we've gotten this far we'll let's um let that run in the background so I've committed that and it's running by the way let's see what chat is saying nothing really all right so I guess we're doing really good or really bad one of the two now I want to see what the mem memory uses is memory usage on this I'm doing really good thank you XC x u t w uh xcode loves it too nice how big is this file I think we can reduce the file size of this and I wanted to so I was on the on I'll run the other day thinking to myself you know I made that YouTube video about how you can reduce memory usage and this is just a commonly known thing you can reduce memory usage of your pandas data frames by really like casting the data types for each column correctly like using smart um and a smart approach to uh casting your data types this is only like 216 megabytes though it's kind of small um so I was thinking why don't I just make a package that automates that process right um and then sure enough I went back to my desk and I Googled it and there is a package that exists that already does that so let's try that out to note trying to learn tqdms up I just saw from you maybe I can apply it to my project hey etsw check out that's a good plug for my YouTube channel uh let's go to your Channel I have a whole video on tqdm although you could just read the docs but also check out my video if you haven't checked out my YouTube yet go to exclamation point YouTube on Twitch oh wait sorry on here and then um yeah subscribe because that's the best thing you could do for me also follow on Twitch that would be nice thank you all who have followed um Okay so the tqd one t cutie Cutie Pie it's somewhere here you just gotta search I think if you just search tqdm it'll pop up um so pandas data data frame memory use package automatically reduce python python package uh efficient pandas data frame pip and there we go Panda's d-type efficiency it only took me like 45 minutes to find it I don't know if it's been maintained recently two years ago so hopefully this still works uh let's try it out let's just paste this efficiency package we're gonna pip install this should have done this Yusuf welcome to chat Diplomat no such thing progress bar tutorial okay cool you saw it uh uh run pip in the root blah blah blah okay so now we have this so apparently all we need to do is okay here's the actual example this is what we're supposed to all we're supposed to have to do all right so now we have a checker object and we can do identify possible improvements flow columns if reduced Precision requested checking integer columns to see what whether smaller size can be used checking string columns all right what what did it find all right so then it says potential improvements I like the way that they wrote this though it's like very pythonic uh so they say cast this as a float 16 D date that should be a date okay so we got a problem here clearly D date why did they do it like this so they have a date column called D date where it looks like they have like it's not going to convert into a date time correctly I need to I need to cast it correctly so pd2 date time so just if I read in this D date column it doesn't work correctly because it's looking for it sees this as an integer and then it just adds it to 1970 which is like the very first date to work from uh year first should be true let's see if that works and then UTC I saw a joke on the internet the the girl and guy goes into a date and restaurant and a girl asks Guy what's your favorite date be like and the guy I was interested oh day month year that's that's great uh you only saw the link to discard okay so my um twitch is sorry should put that here click that link to get to the twitch chat people twitch chat is where it at so how do we parse these dates does anyone have any ideas any I know there's a trick to it all right so should we do this it looks like from reading these docs that there's this format right and we can set format to be okay so like this would be day month year which is not the way we have it we have it as year month day with no slashes yes maybe we don't even need a year first there we go that was so easy okay so there's that okay so this seems like a lot of steps but let's see what this does so we use this Checker cast data frame to lower memory version it really didn't save that much space also this seems like a lot bigger size than I thought it says it's we're using nine gigs of RAM what are we using nine gigs of RAM for I thought we looked at this in the data size was much smaller what's going on here my head is exploding maybe because I wasn't reading the num file before see here we saw this it was DF info was 216 megabytes am I going crazy here now this says it's I gotta just delete all this junk I'm all over the place okay so we gotta we gotta go back here this is eventually gonna be like our do everything sell which we'll download I don't know why my cursor keeps on getting thrown to the bottom here but um all right so this is step one download files and unzip it's kind of step one and two so that's done here getting the links downloading links unzipping files now we're on to step three which is read and format data step one and two one two three into the four we got some people joining in chat year month day is better whoa don't click the link what link click every link no viruses out there so I kind of think our our package that's supposed to reduce our efficiency might be a little bit overkill for this why is it taking forever to read this this is one thing I've noticed about the the kaggle notebooks lately it's like they get hung up or something like reading in this file should not take forever and then when I hit stop that's when it kind of sometimes will actually finish um so I know I can parse dates in this when I read in the CSV so day for uh so I need a date parser object do I need to pass that into this let's try to parse the dates that's this D dates column okay so it doesn't have this parse dates column come on don't take forever to tell me just the list of columns OD date but it's going to do it incorrectly I have a statistics joke but it's not significant nothing significant this is driving me nuts like CPU what are you doing right now I think the only thing that's running right now is you're trying to give me the um just head this data frame comment that out um what do we think should we use this Panda's data frame efficiency package thumbs up or thumbs down what does chat think thumbs up or thumbs down I think it thumbs down for now yeah yeah no one no one wants to say anything so I'm going to assume that means thumbs down look I stopped it and started again then it displayed this um the dates the D-Day didn't format correctly it did it did parse it correctly why are you crying etsw let's delete this I just don't want to go through this you couldn't find thumbs down okay crying face is your thumbs down I think a lot of these can be converted to um I guess not accrued investment income version quarters you Mom you Mom 1700 for data frame that's shape is please don't take forever to tell me the shape of this data frame come on Notebook I'm rooting for you but all right so the data frame is 3 million rows with 1000 I think we can do this I think we could do this like as type category all right so I'm gonna do this df.info right now it's 216 megabytes that's like nothing let's not worry about it um to do what the heck even is this data so let's list what's in one of these directories and we need to take a step back here and actually get a handle on what the files are in one of these directories all right so let's start listing it out we have num.txt we have pre.txt we have so I'm ignoring the readme sub.txt and we have tag.txt now I'm also gonna go ahead and let's just go to my downloads directly directory and then we're gonna unzip this 2013 quarter four version and we're gonna uh open this readme all right this is the readme that I didn't want to look at earlier but it has all the definitions of what we're looking at so when we create our kaggle data set we'll probably want to just copy all this junk in there so people know what we're talking about but this will explain what Things Are all right so sub this is submissions so let's start in their order they start with sub submissions the submissions data set contains summary information about the entire Edgar submission what's a Edgar submission ediegar electronic data Gathering analysis in retrieval system so it's like everything that you have to file with the SEC I think goes through Edgar or Edgar Edgar Edgar Allan Poe has anyone ever seen that uh YouTube video where it's like it's Jackie Chan it's a game where everything's Jackie Chan oh or maybe it says don't say Jackie Chan Edgar Allan Poe and then someone in that video says Edgar Allan Poe the submissions is that okay while other Columns of data were sourced from the this also get an idea of the size of these files so sub is a smallish file it looks like read and sub example we're going to call this sub uh let's go with 2015 quarter three might as well separator is a tab let's change our pandas options this should be display all right so there we go there we go there we go there we go and then we have this uh we're gonna read in this as sub all right c i k a d a a d s h Ascension number the 20 character string formed from the 18 digit hmm how can we make this cleaner I can't just paste this over into markdown very easily yeah this is going to look ugly can I try to use like LibreOffice there we go LibreOffice it kind of gets it then I think I can copy and paste in here undo what I just did oh it's going to try to paste in it copy what it's going to paste like this what is going on here and that is not what I want I'm at kaggle days Paris and I get no sleep any recommendations for tomorrow you're gonna do awesome what's going on there I wish I could go to a kaggle days I was supposed to speak at a kaggle day days in California guess what it was supposed to be I think either yeah it must have been April 2020 right before covet hit so it did not happen all right here's a question why is FY like this should be for a year right this should be able to be an INT but there must be some null values how many of these are null someone was naughty and they'd input in the fur year and FP uh values I don't even know what to do with this okay so sub has the submissions data set contains summary information about entire Edgar's submission we're just gonna have to go with it that's what that is all right let's look up num all right next is tag the tag data contains all standards taxonomy tags not just those appearing to date and also includes all Tech custom taxonomy tags designs in the submissions source wow definitely lawyers wrote this where's the interesting information num numbers all right this this might be the juicy stuff numbers I think numbers is where I have my eyes out for some juicy information being released um because if this has like numeric data then I guess we should have like earnings shouldn't we have like Financial earnings of different companies like should we be able to see twitters um Financial earnings free to look up Twitter or something on this trying to make this exciting as exciting as I can people I'm trying here give me at least that all right pre What is pre prep presentation of statements I'm sure this stuff will be super interesting once I just figure it out once I figure out like what to make interesting out of it but I don't even understand what it is enough uh the pre data set contains one row four line of the financial statements tagged by the filer The Source data is set is as filed blah blah blah blah by the way let's see if my one version of this from earlier actually ran okay this broke I'm gonna have to figure that out okay um so I'm what I'm thinking about is can I just do all right so PD no let's do glob star dot sub dot txt this should give us all the sub files but does sub in it somewhere have the year that it's filing for I wonder if they'll be duplicates from the quarters this is the date that it was filed yeah we've we've done this before we're gonna do backslash on this and then we're gonna assign file name is f we're going to append these data frames we're going to call it subs now we have a Subs which should be in a shape should be pretty large okay I guess Subs are not that big how big of a sub we have a foot long here is this a foot long sog non-data what's up in chat all right so now we have a file name so at least we link to what file name it originally came from Five Dollar Foot long five dollar foot long all right so this does all the subs and then we're gonna save it as a parquet file and that in theory does our number four on our check it checklist saves it as a parquet file we just need to do this to all four of these so tag we're going to call this tag um this is how we got the sub-files right so let's delete these tag files this is going to be tag this should be good and what is the tag actually what is this what is this what is this tags where are the descriptions for it not here it is in the tag section which is just tags it's just call tags so we're going to call this tags combined so that bad boy didn't take too long then we're gonna do numb numbers combined num files nums I have an idea I have a um Inkling um feeling that these are going to be there's going to be an issue with this because some of these files are a lot bigger than the other ones and haven't we just been looking at the small one so far yeah I think gnome is the really big one so maybe we'll have to save each of these off as their own parquet files Mr Gabriel welcome to chat how are you doing there we also have someone from Portugal if you want to join the twitch chat that's the best place to hang out with us please do join us okay so numbers didn't it wasn't that bad yeah hey what was that Mr Gabriel thank you for subscribing let's go ahead and spin the wheel every time we get a new subscriber on Twitch Spin the Wheel by the way if you haven't checked out my YouTube yet he knows it he or she Mr Gabriel I'm assuming oh my favorite and it landed on play of sweet bass lick I'll be right back I have a special one for you [Music] are you doing all right you guys did not come here to hear me play horrible Bass but that's for you thank you for subscribing you can subscribe uh Royal loyal is in chat welcome you're trying to learn bass but missing the base well that's that's kind of a requirement unfortunately um okay so we're almost there we got the number files we just need these pre-files now we're also making an assumption that they're always going to be this same format okay so this is presentation I could call this POS that that might not be cool all right so this should be good do I think log normal stuff is BS what does that mean log normal what is he talking about foreign these songs are definitely copyrighted I'm realizing I came here you said now I need to go I just got here to leave my Prime oh thanks thanks for leaving your pride Mr Gabriel I appreciate it pre-files in four lip Loop yes you caught it you caught it thank you like when do you differentiate with log to make it stationary not quite following log normal stuff is BS pretend you have a Time series that isn't normal distributed okay so like yeah so then you can apply like a uh exponential transform to it to make it normally distributed you can always try transforms all right here we go um I think up until this point if I save this we've achieved our goal technically technically that standard practice and you hate it oh you hate doing transforms it makes it harder to like um it makes it harder to interpret so I feel you there I like how they're disclaiming that this data might not be correct so I want to just see if we can figure out one interesting thing from this data set like if we could pull in first first what I'm doing is I'm submitting this data set as a new notebook to see if it'll run all the way through and just do what we what we've coded it to do so far which is um number one find all the data sets zip files number two download them and number three is uh save them concatenate them together and save them as a parquet file now keep in mind we're doing that just to the first five I've subsetted it here just for like testing I think once again once it gets bigger it might not be small enough to save or this might cause like an issue with memory when we're doing this but I don't know for sure I guess if we think about it if that one file was like 250 megabytes and we have going back to 2009 so we have like 13 years and we have four quarters per year um yeah that's gonna be pretty big right that's gonna be pretty big so maybe what we need to do all right did this work this ran and then the data output is all here as our combined parquet files hmm let's wrap these in tqdm it's going to be too big right like look we did five files in the amount of ram that it used is seven gigabytes um I'm trying I'm starting to think that I'm literally and catching Edgar data and doing an analysis for my job so funny hey that that uh zq then you can explain to me okay Kent do you think from this data that we could pull out like Twitter's SEC filing every year or Tesla and try to get some information like make a plot with it for sure okay could you maybe tell me where to look in this for that um and then let's also remember the small files here so num is big and pre is Big the other ones are small so I think when we get to hear num and pre we might need to save it off by file and not do this concatenating it might be good to concatenate by year so it starts with 2009 right this data starts in 2009 and let's have it go to like 20 30 or something crazy so if we're gonna do for a year in range right all these years so I guess it goes to 20 29 but not like it matters because there are nothing for those years and then we're going to get the number files for that given year so this would be like year has to be in here so this is going to look like this it'll be the year and then the files all right so 2009 has no num files oh because we haven't downloaded them in here but it should concat then we should be able to concatenate by year and then we'll at least have a combined by year that makes sense no objects to concatenate uh if length of num files equals zero then it's just going to continue so it'll just skip there and then let's also do the tqdm here I don't think we need to do the tqdm in here yeah I think this is good and then let's just do the same thing with the pre files presentation files here we go presentation this is where I totally uh pre-files all right all right so it's going to do a tqdm for all these years and the last two years are going to like be super fast or search the reports for keywords uh thanks Muhammad in the YouTube chat come over to Twitch join the chat hey guys congratulations we hit 100 messages in our chat today okay that that you've been helping me out 10q is the easiest I don't know what that means even get the balance sheet and you can do analysis on it is that what you mean the 8 Q or 10K 10q is what I use all right is that in one of these files okay so um before I get too far I think this is working should we let it off the leash and try to do it remove this this uh debugging thing where we only had five files and try to run this on all the files see how it does I think we're gonna we're gonna be testing the limits um so what is that in like where do I find those are egar filings this this should have all the egar hey we gotta what what just happened on Twitch that you subscribe with Prime thank you so much let's go ahead and spin the wheel for you you might get your own bass lick you might get me to do a hamstring stretch which would I like oh this is my favorite I have to yell scr Kevin as loud as I can Kevin wake up everyone in the house that's for you though add you Wu noise what is UW noise can I show the entire code um yeah so the just to recap here a little bit ah you know what I can do actually is I can take this notebook I can edit no not edit I'm already editing I could take this notebook output settings make it public for you all save changes go to the note notebook and then paste this in the chat now you should be able to go click that link and you can see what I've worked on here so far this should load up for you give me a thumbs up if you can re if you can click that link uh then you should be able to follow along y'all okay so now that we know that works we are going to try to find what my boy that squiz is telling about 8 Q 10q what the heck is 8q 10K 10 Q SEC filing okay so these are like specific filing types let's turn off the Dark theme for this site it doesn't need it 10K requires companies to provide financial information ongoing periodic statements so can I find 10K in here is it going to tell me yeah here we go look at this document says form interactive data attachments to forms it should include stuff from 10ks I'm getting excited but so then how do I find a 10K uh uh we now need to like figure out what these columns mean but that's boring but we have to but it's boring but we have to but it's boring but we have to okay so we're getting into it we're getting into the the meat of this I think maybe sub s should tell me this Subs data set should tell me the submission the filing made by company X and then maybe it links to this a d s h the Ascension number is the 20 character string form from the 18 digit number assigned to the SEC each Edgar submission so do the other filings or sorry data sets then have this like as a common key that we can then look up I thought I saw ad yeah so that's also in Num so if you go to name here Wells Fargo has 22. okay so now we're finding like companies that we've heard of let's try to find Twitter so contains Twitter maybe I need to lower it so if I sum this we need to at least get one then we know we have a hit c i k is the right thing by the way what is cik also we have someone I should probably open this in a new window but someone's writing in a different language I want to translate it good morning everyone someone said in Russian hello welcome on YouTube come over to Twitch that's where we're having the chat okay so there's no Twitter which I'm not a surprised by because I didn't lower it now when I lower it there is there are five filings that have Twitter in the name alright so here are all the filings from Twitter we've only ran in so far in this notebook the other one that's running in the background um oh shoot the current version has an error so we might need to fix that click here to see the latest one run let me see the current version with the error what is this saying an error for I I don't see what the error is come on you got to give me like show me the error show me the money with the errors I'm not surprised it aired certainly think maybe I should do this year thing with all the other stuff like the tags even though these are super small files 22 megabytes anyways we're gonna we have a new goal which is to find Twitter filings all right you know what I'm gonna do I'm gonna do like another debug where I only filtered down the 2020 filings all right so this is like if I sort this all right there we have 20 20 all the way to quarter three of 2022 that'll give us some good recent data to work with which will be a little bit interesting and then we'll be able to test to make sure our script is running for the 2020s data so it's a little bit more manageable 11 files what's up Vikes fan are you a Vikings fan are you like a Kirk cousin Vikings fan is that what the Vikes is all about because that game today was insane absolutely insane bills Vikings game today um I'm also a Washington fan so seeing Kirk cousin doing well makes me happy but also a little bit jealous because we let him go from our we let him go from our team for free because they didn't want to pay up and he's turning out to to be a pretty decent quarterback hey Arash welcome so these have all downloaded and now I think it's unzipping them we can go to this files directory to see what we got here oh it's gonna actually uh it's actually gonna run for all these other like ones that we had on earlier that's okay so let's look at the 2022 quarter three filing folder yeah so at least the format stays consistent between 2009 and 2023 which 2022 which is pretty impressive so let's try to remove 20 20 uh 20 201 star so we uh now if we LS it shouldn't have any of the 2019 zip files but it does have the directories so we have to remove these directories recursively I think I can just do this yes now we've removed everything that except for the 2020s very good and now we're going to save this version and see if that's at least stays off and try to run these bad boys oh pretty fast Snappy look at that we got an error where did it air all right so we might need to set low memory equals faults let's try to get through it that that was in the what file the sub file we run that one also rerun this is the one that's going to take a while the number data oh look but it just zoomed through what's my job in specifically I work it I work um for a company that consults and we do data science analytics stuff and we do a lot of first like sports mainly the NFL so Sports data analytics for health and safety but I have worked previously um in Pharmaceuticals I've worked for in the hospitality industry I've worked for a power company I've worked for the federal government the U.S federal government uh oh you did sports betting stuff and now you're a Quant nice I am not allowed to do anything related or even close to sports betting stuff that would get me fired so I don't do anything with bedding but I do like sports stuff non-betting related if you know what I mean all right data how you doing data go to viewer is this running in the background yeah it's still running here okay so let's think about it all right now we got it we got our submissions now we have all the Twitter submissions it's a lot more than what we had before so what's the shape of this I'm in love with the shape of you 56 filings they must each be different filings so let's check out what column is going to tell us okay so here we go form and that was saying you were saying to look at which one we look at the 10K uh form what do we got here all right so mostly 8K filings then they also have their 10K filings what is that like the annual one do I use Pike Pi spark in my last job I did not in this one uh govey rep yeah for sure so which one should we use 10K I use Q so 10q so let's see where the form equals 10q and then let's get these adsh all right so this should get us the a d s H's for Twitter now let's go into nums query well let's just pull this first one I'm trying to think if I do num query adsh equals we have to make sure we're pulling the correct year yeah so this filing is not in this year so we could do adsh is in all right so these are all the 10q Twitter filings and then now that we have all of that in this nums I'm guessing we need to look at like the interesting tags because this value is probably going to be related to this tag cash equivalent restricted cash and restricted cash equivalents cost and expenses that's the balance sheet so if we do like and query tag equals costs and expenses are these all for the same this is what's confusing to me okay so this is for each quarter within this date uppercase m year month year I always forget I think month is lower case what even is the D date they say D-Day is the period end date so let's set the index as the D date value plot kind is bar plot we also should just do as type daytime 64 and as as like day like I don't know why this is going to the nanosecond so this should be costs and expenses at Twitter does that that look right so 2.5 million wait nine zeros 2 billion you didn't know you didn't know hello friend I'm looking for flake uh check the links yeah exclamation YouTube people are looking for um other videos by the way if you're watching this later and you haven't subscribed on YouTube please do that you can go to my site if you go to live you'll see all the live streams so this one I worked on flight data this one I also worked on flight delay data so I'll paste these in since people were asking about them there we go my normal streaming schedule is either Tuesday Thursday or Sunday and I don't do it every time though every one of those days it's just usually if I am going to stream it's going to be those dates and you kind of got to catch me when you catch me that's why I'd say uh check out my Discord server which I will also paste here uh definitely join on Twitch and then uh you can also check me out on Twitter that's where I hang out do I ever work in the command line dude all the time that's what we're doing here so I just released a video this week where like I was talking about my setup and how I use all my setup and I do have like a kind of a click baby type title for it ultimate coding setup for data science but I mean it's just my setup for data science so I say that at the very beginning I kind of get past the whole clickbaity thing but I talk about what I use I use Ubuntu I use tmux a lot I use uh Jupiter lab and python of course so this is the costs and expenses at Twitter does that seem right Twitter costs and expenses SEC filing let's turn the Dark theme off on this so this is their form for the year of 2020. Twitter 10K report 2022 uh so this is actually from Twitter's website 2021 annual report uh let's do that annual report should be good enough look how cool this is they have a really cool annual report I wish I could work on this I like how it's like super cool first two pages makes me really excited gonna it's like cool magazine that we're gonna read and then we go down and oh it's just it's just an SEC filing this is super boring those first two pages get you so excited for some crazy awesome manual report uh uh costs and expenses cost and expenses where are we right now in the three months ended in December 31st 2020 add engagements decreased 12 through the three months the following table sets forth the consider statements of operations data for each period presented in thousands cost and expenses so does this mean 1.7 billion because it's in thousands confused and there's also football no continent expenses include stock baits compensation as follows belows and thousands uh someone on YouTube is asking if I work for a company yeah we just talked about that I do work um do you work for a company this is just my fun nighttime fun stuff I do this for you I do this for the people there's no way I make this like negative money streaming and making YouTube videos I have no idea why I do it I think it's fun and I like to give back that's what I tell myself but no I definitely do not make money doing this it wasn't a stupid question I don't mean to respond to it like like it's stupid questions totally legit question but the answer is no the answer is no okay let's see by the way I did paste this oh the last runnier Colonel had an error again I thought we figured all this out where did it air why did it air it looks good oh click here to see the current version okay the current version expected bytes got but got a float type object okay okay I've seen this before I'm surprised it didn't air for us here then submissions combined Why didn't it air for us here is it because this doesn't have low memory equals faults yeah it doesn't have low memory equals faults when here it does is that because let's make sure version 5 failed on that same thing yeah what's the what's the deal low memory equals false I have it right here let's save this version all right so these are the filings for 2020 22 21. uh let's sort sort index also one before we plot this because it's a very deceiving plot otherwise I guess it's deceiving no matter what so expenses and costs went way up let's see what unique tags we have that might be interesting Revenue property plant and equipment and financial lease rate of use before deprecating oh my gosh that's super long common stock shares authorized Cash Cash equivalence restricted in cash restricted equival

Original Description

Dataset creation on kaggle of SEC Filing data. Using kaggle notebooks. The notebook: https://www.kaggle.com/code/robikscube/sec-dataset-creation/ Watch live on twitch: https://www.twitch.tv/medallionstallion_ My other videos: Speed Up Your Pandas Code: https://www.youtube.com/watch?v=SAFmrTnEHLg Speed up Pandas Code: https://www.youtube.com/watch?v=SAFmrTnEHLg Intro to Pandas video: https://www.youtube.com/watch?v=_Eb0utIRdkw Exploratory Data Analysis Video: https://www.youtube.com/watch?v=xi0vhXFPegw Working with Audio data in Python: https://www.youtube.com/watch?v=ZqpSb5p1xQo Efficient Pandas Dataframes: https://www.youtube.com/watch?v=u4_c2LDi4b8

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Rob Mulla · Rob Mulla · 58 of 60

← Previous Next →

A Gentle Introduction to Pandas Data Analysis (on Kaggle)

A Gentle Introduction to Pandas Data Analysis (on Kaggle)

Exploratory Data Analysis with Pandas Python

Exploratory Data Analysis with Pandas Python

7 Python Data Visualization Libraries in 15 minutes

7 Python Data Visualization Libraries in 15 minutes

Kaggle competition starter notebook walkthrough

Kaggle competition starter notebook walkthrough

Kaggle Competitions: A Beginner's Guide to Winning

Kaggle Competitions: A Beginner's Guide to Winning

Jupyter Notebook Complete Beginner Guide - From Jupyter to Jupyterlab, Google Colab and Kaggle!

Jupyter Notebook Complete Beginner Guide - From Jupyter to Jupyterlab, Google Colab and Kaggle!

Audio Data Processing in Python

Audio Data Processing in Python

Complete Data Science Project!

Complete Data Science Project!

Make Your Pandas Code Lightning Fast

Make Your Pandas Code Lightning Fast

Image Processing with OpenCV and Python

Image Processing with OpenCV and Python

Speed Up Your Pandas Dataframes

Speed Up Your Pandas Dataframes

This INCREDIBLE trick will speed up your data processes.

This INCREDIBLE trick will speed up your data processes.

Complete Guide to Cross Validation

Complete Guide to Cross Validation

Easy Python Progress Bars with tqdm

Easy Python Progress Bars with tqdm

Economic Data Analysis Project with Python Pandas - Data scraping, cleaning and exploration!

Economic Data Analysis Project with Python Pandas - Data scraping, cleaning and exploration!

Python Sentiment Analysis Project with NLTK and 🤗 Transformers. Classify Amazon Reviews!!

Python Sentiment Analysis Project with NLTK and 🤗 Transformers. Classify Amazon Reviews!!

Get Started with Machine Learning and AI in 2023

Get Started with Machine Learning and AI in 2023

The Trick to Get Unlimited Datasets

The Trick to Get Unlimited Datasets

Video Data Processing with Python and OpenCV

Video Data Processing with Python and OpenCV

Object Detection in 10 minutes with YOLOv5 & Python!

Object Detection in 10 minutes with YOLOv5 & Python!

Pandas for Data Science #shorts

Pandas for Data Science #shorts

Object Detection in 60 Seconds using Python and YOLOv5 #shorts

Object Detection in 60 Seconds using Python and YOLOv5 #shorts

Machine Learning for Facial Recognition in Python in 60 Seconds #shorts

Machine Learning for Facial Recognition in Python in 60 Seconds #shorts

Time Series Forecasting with XGBoost - Use python and machine learning to predict energy consumption

Time Series Forecasting with XGBoost - Use python and machine learning to predict energy consumption

Detect Text in Images with Python - pytesseract vs. easyocr vs keras_ocr

Detect Text in Images with Python - pytesseract vs. easyocr vs keras_ocr

Solving an Impossible Riddle with Code

Solving an Impossible Riddle with Code

Do these Pandas Alternatives actually work?

Do these Pandas Alternatives actually work?

Time Series Forecasting with XGBoost - Advanced Methods

Time Series Forecasting with XGBoost - Advanced Methods

Data Science Uncut - Data Shootout Kaggle Competition (Aug 1 2022 Stream)

Data Science Uncut - Data Shootout Kaggle Competition (Aug 1 2022 Stream)

Kaggle Dataset Creation from Scratch- Data Science Uncut (Aug 10 2022)

Kaggle Dataset Creation from Scratch- Data Science Uncut (Aug 10 2022)

Chess Board Computer Vision AI - Data Science Uncut (Sep 7, 2022)

Chess Board Computer Vision AI - Data Science Uncut (Sep 7, 2022)

25 Nooby Pandas Coding Mistakes You Should NEVER make.

25 Nooby Pandas Coding Mistakes You Should NEVER make.

DEFCON Hacking AI CTF Solution on Kaggle - Data Science Uncut Sep 11, 2022

DEFCON Hacking AI CTF Solution on Kaggle - Data Science Uncut Sep 11, 2022

More Chessboard Computer Vision AI - Data Science Uncut - Sep 13

More Chessboard Computer Vision AI - Data Science Uncut - Sep 13

Medallion Data Science Live Stream

Medallion Data Science Live Stream

Community Kaggle Competition Overview - Corn Classification (

Community Kaggle Competition Overview - Corn Classification (

Deep Learning Image Classification - Corn Kernels - Data Science Uncut

Deep Learning Image Classification - Corn Kernels - Data Science Uncut

OpenAI Whisper Demo: Convert Speech to Text in Python

OpenAI Whisper Demo: Convert Speech to Text in Python

Yolov7 Custom Object Detection in Python Tutorial - Chess Piece Detection

Yolov7 Custom Object Detection in Python Tutorial - Chess Piece Detection

Live Kaggle Coding - Enzyme Stability Prediction - Data Science Uncut Sep, 27 2022

Live Kaggle Coding - Enzyme Stability Prediction - Data Science Uncut Sep, 27 2022

Finding Chess Cheaters with Python! - Data Science Uncut Livestream

Finding Chess Cheaters with Python! - Data Science Uncut Livestream

Data Science Uncut - Kaggle Community Competition & Chess Data Analysis - Oct 4, 2022

Data Science Uncut - Kaggle Community Competition & Chess Data Analysis - Oct 4, 2022

Flight Delay Dataset Creation (Data Science Uncut)

Flight Delay Dataset Creation (Data Science Uncut)

5 Reasons to Kaggle #shorts

5 Reasons to Kaggle #shorts

♟️ Data Science - Chess Data Analysis

♟️ Data Science - Chess Data Analysis

EXTREME PYTHON & DATA SCIENCE LIVE STREAM

EXTREME PYTHON & DATA SCIENCE LIVE STREAM

What is Clustering in ML?

What is Clustering in ML?

What is K-Nearest Neighbors?

What is K-Nearest Neighbors?

LIVE CODING: Flight Data Exploration with Pandas & Python

LIVE CODING: Flight Data Exploration with Pandas & Python

Kaggle Survey vs. Twitter Sentiment

Kaggle Survey vs. Twitter Sentiment

If Top Chess.com Players were STOCKS - Live Coding Data Anaylsis Stream

If Top Chess.com Players were STOCKS - Live Coding Data Anaylsis Stream

Data Visualization BATTLE!

Data Visualization BATTLE!

LIVE CODING: Stocks & Sentiment Analysis

LIVE CODING: Stocks & Sentiment Analysis

Progress Bar in Python with TQDM

Progress Bar in Python with TQDM

Flight Cancellation Data Analysis

Flight Cancellation Data Analysis

Synthetic Dataset Creation for Machine Learning - Blender and Python

Synthetic Dataset Creation for Machine Learning - Blender and Python

The Ultimate Coding Setup for Data Science

The Ultimate Coding Setup for Data Science

Dataset Creation SPEED RUN - Live Coding With Python & Pandas

Dataset Creation SPEED RUN - Live Coding With Python & Pandas

Data Wrangling with Python and Pandas LIVE

Data Wrangling with Python and Pandas LIVE

Forecasting with the FB Prophet Model

Forecasting with the FB Prophet Model

This video teaches how to create a dataset using Python and Pandas, specifically for SEC Filing data on Kaggle, and demonstrates various techniques for data retrieval, formatting, and analysis. The video also covers tools such as Beautiful Soup, Pandas data frame efficiency package, and tqdm. By following this video, viewers can learn how to build efficient data pipelines and perform data analysis tasks.

Key Takeaways

Import necessary libraries
Create a Kaggle notebook
Use Beautiful Soup for web scraping
Download and unzip files
Read and format data
Use Pandas for data manipulation
Optimize data processing
Use data frame efficiency package
Visualize data

💡 Using the right tools and techniques can significantly improve the efficiency of data processing and analysis tasks.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related Reads

Why PixelToolsPro is About to Become Your Next Go-To Image Editor

Discover PixelToolsPro, a fast and private image editor that doesn't require uploading images to a server, and learn how to use it for secure editing

I Couldn't Find a Good Image Metadata Tool, So I Built One

Learn how to build a custom image metadata tool to solve real-world problems and improve workflow efficiency

Dev.to · Robin Hood

Building a Browser-Based Image Resizer with Step-Down Scaling and Crop

Learn to build a browser-based image resizer with step-down scaling and crop using interactive tools and social media presets

Dev.to · Arhan Ahmad

Comment créer des images professionnelles sans Photoshop avec l'IA

Create professional images without Photoshop using AI with Photopea and an AI agent

Dev.to · Mohamed Amine Ben Mallessa

How to Make Pinterest Pins with AI 🤖