Resampling - p.9 Data Analysis with Python and Pandas Tutorial
Key Takeaways
This video tutorial demonstrates resampling techniques using Python and Pandas to smooth out data by removing noise, with a focus on time series analysis and data visualization. The tutorial covers various methods of resampling, including increasing or decreasing granularity, averaging or summing data, and calculating open, high, low, close values.
Full Transcript
What is going on everybody? Welcome to part nine of our data analysis with Python and Pandas tutorial series. In this part, what we're going to be talking about is reampling. So the idea of resampling is it lets you change the sample rate of the data that you're looking at. So you can either um increase the sampling or decrease sampling basically or increase granularity or decrease. Now if you decrease granularity that's fine. If you increase granularity, you're not really you're not going to retrieve the old data that you used to have. And I'll I'll show you what I mean by that probably if not in this video and probably a few videos when we actually utilize resampling in our data. Now uh the idea of that is like let's say like with stock prices for example stock prices come in intra second. I mean these are like millisecond you know trades that occur but not everybody needs access to millisecond data. And not only that, like what if you try to graph, you know, five years of millisecond data? You're going to crash every system you have your hands on pretty much. So instead, what you do is you resample that data. So instead of, you know, you've got millisecond data, but what you can do is you can either write your own program to do it or use something like pandas. And what happens is it's going to take, let's say you've got millisecond data, but you resample it to uh, you know, one day. What it's going to do is it's going to take all the prices in that day, all those milliseconds, add them together, average them. Usually, you can do other things. I'll talk about that in a moment. Uh, and then it'll give you that price. So, that'll be the one day price based on all the prices that occurred that day. Now, you can also reample and instead of doing an average of everything in that time period, you can do a sum and you can also do an open, high, low, close value. So that'll tell you in that period what was the highest price, the lowest price, the starting price of that period, and the ending price of that period. Anyway, uh let's go ahead and jump in. Uh this is our starting code. Just the same code from the previous tutorial. So you should already have this code pretty much typed out. Um if you don't have it, go and uh hit up the tutorial uh part nine here. There should be a link to it in the description. Head there and uh you can copy and paste the starting code. So, if you're just trying to figure out about reef sampling now, um, check that out. So, uh, we've got this HPI data. We're going to go ahead and leave the correlation stuff alone. I mean, we can keep the correlation there, but we don't really need that right now. Um, and in fact, we don't need this either. And we are going to graph. So, we'll go ahead and um, we can leave the benchmark. We're plotting the benchmark here. here. Let's just let's uh uncomment basically all this graph stuff. And we've got HPI data. We don't really need the benchmark. We can keep plotting HPI data. And in fact, actually, we're not going to plot HPI data. Let's plot HPI data uh for the great state of Texas. Okay. So, we'll plot that data. But then what we're going to do is we're going to define a new column here. And what we're going to say is let's say see this data is uh sampled at once a month. At the end of the month it's grabbing the housing price index. So what if we wanted to do something like this? We could say Texas one year equals HPI data Texas uh and then we do dot reample and then in here you throw the valuation for the reample. For example, if you wanted to resample hourly, you could put an H, right? Or if you wanted to do um and actually it would be a capital H. If you wanted to do daily, capital D, and so on. Although our data is monthly, so even if you did daily, it would just be a bunch of repeating numbers. So for yearly, we can do A for annual. Now, you might be like, okay, well, how do we know all of these? Well, just so happens there is uh here in Panda's documentation, although this is really hard to find. Um I actually had to use Google. I couldn't even find it on my own looking through their documentation, but I should have a link to this um either in the description or it's linked in the uh text version of this tutorial. Uh but these are all of the shorthand versions. So, as you can see, there's quite a bit, right? You've got the month end. You could do the beginning of the month. So, with MS you could do, right? we're doing like yearly. So this is year end. This is business year end. But then you could also do at the start of the year if you wanted and so on. You have a lot of options for how you want to resample data. Now we're just going to resample annually. And the default for a resample is to use the mean. But if you wanted to change that, the keyword for that is how. And then you could do an equals. And then for now we'll say mean. But again, you don't have to add that if you want to resample by the mean because that's the default. But anyway, we've got TX1 year. That's fine. Uh we'll print out the head of that just to see it. TX1ear head. Uh now we're plotting this data here. Let's go ahead and plot um basically the same thing only we're going to plot TX1ear instead of that. So we're going to graph both of these just so we can see um the the difference here. So let's go ahead and save and run that. That should pop up a graph for us. Sure enough, there it is. This was resampled by year. Now, uh the the red line would be the we could have added labels and stuff here and and if you want to if you do the df.plot, it'll automatically add a legend and stuff for you, but if you plot one by one, it's not going to do that. Um so, what you can do is like see how we have like this line after this.plot, you can add a label. So we can say label equals monthly monthly uh tx hpi. And then down here we can say the label equals yearly tx hpi. And then we remove the legend. Let's just not remove the legend. We'll just add a legend basically. And there you go. All these it's in our way. Uh location 4 is usually the bottom right I believe. So you can do this lo equals 4 for the legend. That'll move it out of the way hopefully for us. Yeah. Uh so let me make this there. Okay. So you can see yearly is that red line and then the blue line is the monthly. Now an interesting thing first of all uh you might be like wow if we resample by year it looks like we have a predictive uh assessment here right? Uh not quite. Uh so this is by year and the year is probably being marked uh right at the beginning of the year. So it's yeah so like this is January 1986 but the problem is it's uh actually it's year end but anyway it's resampling with a lot of bias when you do that if you resampled like the current year would be lagging basically and and it and it is so just keep that in mind the the plot happens basically after the year is completely over. So you will never it won't actually be a leading indicator. So this is kind of misleading anyway. Uh but as you can see it follows very closely uh cuz it is just a mean. Now one thing to note is you don't get any of these little squiggles. Now these squiggles are very interesting. So uh this will tell us something interesting about the housing market. So for example let's zoom into some of these squiggles and just hover over here and if you hover over the peaks you can look down at the lower uh right right down here and you can see when these are occurring. That's uh um occurring in June 2007. uh June 2008, June 2009 and so on. This this pattern continues pretty much every peak occurs in June. Can you guess where the troughs are occurring? Anyone right January or December? But generally it's between December and January right now. But yeah, so January, December, these are what we would call a cycle for sure. And it's very interesting. thing. I mean, the entire housing market, every state follows these that rule pretty much every year. It's at its peak in June, at its cheapest in December. Now, why might that be? Uh, first of all, you've got the winter blues, probably no one's really thinking about moving. But also, the biggest thing is most likely if you're moving and you've got maybe uh kids and stuff like that, it's very difficult to move them in the middle of the school year like that. Whereas in the summertime, we have a nice break usually and we can get people moved and stuff like that. But anyway, very interesting to see that that's happening. Also, it could be for tax purposes, too. Uh, buying a new house in December represents a little bit harder of a risk sometimes. But anyway, um, a lot of reasons for that. I'm just kind of guessing at a few. I'm sure there's plenty online. This is clearly a trend. I'm sure you can find a lot of information on it. Anyway, closing that out. That's resampling. Uh, I do want to show you guys some of the other resampling that we can do. Uh just real quick instead of you know annual uh how mean you can do how OHLC and then uh let's see plot we could plot it. Uh I think that plot will work out actually. The label is going to throw us for a loop. We'll see if that plots. I'm not sure but I think it should. Yeah it did. Uh the plot's pretty ugly but basically it's plotting the open high the low and close for that exact period. And sure enough, you can see that's occurring. But what's most interesting is is here, right? So this this is the new resampling and these are the open. So these are all percent changes. So like here, the lowest percent change for that year is 52. The highest is 75. That's a significant change. Uh the low and the close is that 97. So really the high and the close are identical. So that just means for the year prices went up as opposed to if you had the high was much higher than the close. it was like well during the year let's say in the uh probably in June uh the price was as high as it's ever been anyway. So, uh, that's resampling. Here we were resampling kind of up. And generally, you're always going to resample to a time frame that's larger than what you currently have because if you um if you down sample or some people want to call this like super sampling, uh you're not going to somehow create more uh more data. Like it's not going to be able to just magically create more data for you uh or just guess somehow what the data would have been in between. So when you resample generally you're you're you're increasing but as you can see uh by looking here I mean we've got resampling down to nanoconds. Okay so in microsconds but nanconds I mean that's crazy. So depending on what kind of sensors you're using maybe or what how what kind of data you're dealing with you might actually find that you resample down there. But also if you need your data to be perfectly uniform i.e. a measurement for every let's say nancond. Um this is a way that you could kind of cheat the system and do that. But you can also use filling and stuff like that. But it is a way to take unstructured you know data that you really want to be structured in a timely manner um and make that happen. So that's another way that you can do it besides just simply trying to uh decrease the size of your data without forfeiting too much of the quality. Anyway, uh that's it for resampling. We'll wind up using resampling as a sort of hack uh in a couple of scenarios here. So you'll see what I mean uh later on. But resampling is really useful for all kinds of reasons. So it's definitely something you must know for uh doing work with data analysis and pandas. If you got any questions or comments, leave them below. Otherwise, as always, thanks for watching. Thanks for all the support and subscriptions. Till next time.
Original Description
Welcome to another data analysis with Python and Pandas tutorial. In this tutorial, we're going to be talking about smoothing out data by removing noise. There are two main methods to do this. The most popular method used is what is called resampling, though it might take many other names. This is where we have some data that is sampled at a certain rate. For us, we have the Housing Price Index sampled at a one-month rate, but we could sample the HPI every week, every day, every minute, or more, but we could also resample at every year, every 10 years, and so on.
Another environment where resampling almost always occurs is with stock prices, for example. Stock prices are intra-second. What winds up happening though, is usually stock prices are resampled to minute data at the lowest for free data. You can buy access to live data, however. On a long-term scale, usually the data will be sampled daily, or even every 3-5 days. This is often done to keep the size of the data being transferred low. For example, over the course of, say, one year, intra-second data is usually in the multiples of gigabytes, and transferring all of that at once is unreasonable and people would be waiting minutes or hours for pages to load.
Using our current data, which is currently sampled at once a month, how might we sample it instead to once every 6 months, or 2 years? Try to think about how you might personally write a function that might perform that task, it's a fairly challenging one, but it can be done. That said, it's a fairly computationally inefficient job, but Pandas has our backs and does it very fast.
Sample code and text tutorial for this video: http://pythonprogramming.net/resample-data-analysis-python-pandas-tutorial/
http://pythonprogramming.net
https://twitter.com/sentdex
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from sentdex · sentdex · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Matplotlib Python Tutorial Part 1: Basics and your first Graph!
sentdex
Python Encryption Tutorial with PyCrypto
sentdex
Python's Logging Function
sentdex
wxPython Tutorials 1: Making Windows GUIs with Python : Installing + 1st window!
sentdex
wxPython Tutorials 2: Making Windows GUIs with Python: Customizing Window Parameters
sentdex
wxPython Programming Tutorial 3: Menu Bar and Menu Button
sentdex
wxPython Programming Tutorial 4: Panels
sentdex
wxPython Programming Tutorial 5: User Input Saved To Variables
sentdex
wxPython Programming Tutorial 6: Multiple Choice Input
sentdex
wxPython Programming Tutorial 7: Adding Static Text and Colors
sentdex
wxPython Programming Tutorial 8: Custom Button Images
sentdex
wxPython Programming Tutorial 9: Tool Bar Items and Sub Menus!
sentdex
Basic PHP Tutorial 13: Multi-dimensional Array
sentdex
Basic PHP Tutorial 15: Functions and Global Variables
sentdex
Basic PHP Tutorial 12: Associative Array
sentdex
Basic PHP Tutorial 14: Foreach loop
sentdex
Basic PHP Tutorial 16: Include and Require
sentdex
Basic PHP Tutorial 7: Assignment, comparison and Logical operators
sentdex
Basic PHP Tutorial 4: Variables and Comments
sentdex
Basic PHP Tutorial 11: Arrays part 1, basic array
sentdex
Basic PHP Tutorial 6: If else and else if conditionals cont'd
sentdex
Basic PHP Tutorial 1: Intro to PHP
sentdex
Basic PHP Tutorial 3: HTML with PHP
sentdex
Basic PHP Tutorial 9: While Loop
sentdex
Basic PHP Tutorial 10: Switch Statement
sentdex
Basic PHP Tutorial 2: Print and Echo
sentdex
Basic PHP Tutorial 5: If else and else if conditional statements
sentdex
Basic PHP Tutorial 8: Arithmatic Operators: Doing math with php
sentdex
Basic PHP Tutorial 17: User Input Form Example / String Manipulation
sentdex
Basic PHP Tutorial 18: HTML Entities and forms cont'd
sentdex
Basic PHP Tutorial 19: Finding words in strings
sentdex
Basic PHP Programming Tutorial 20: Saving to a File / writing and appending
sentdex
Basic PHP Programming Tutorial 22: Hashing part 2: salting
sentdex
Basic PHP Programming Tutorial 23: Variables in Strings and tokenizing
sentdex
Basic PHP Programming Tutorial 21: MD5 Hashing For Security
sentdex
Basic PHP Programming Tutorial 24: String similarity
sentdex
Basic PHP Programming Tutorial 25: Time and Time stamps
sentdex
Basic PHP Programming Tutorial 26: Die and Exit
sentdex
Basic PHP Programming Tutorial 27: MySQL Databases Part 1
sentdex
Basic PHP Programming Tutorial 28: MySQL Database Part 2: Reading From Database
sentdex
Basic PHP Programming Tutorial 29: MySQL Database Part 3: Inputting Data
sentdex
Basic PHP Programming Tutorial 30: MySQL database in Use
sentdex
Django Tutorial Web Development with Python Part 1: Installing Django
sentdex
Python Tutorial: File Deletion and Folder Deletion / directory deletion
sentdex
Python Tutorial: How to Rename Files and Move Files with Python
sentdex
3D Graphs in Matplotlib for Python: Basic 3D Line
sentdex
3D Plotting in Matplotlib for Python: 3D Scatter Plot
sentdex
3D Charts in Matplotlib for Python: Multiple datasets scatter plot
sentdex
Sikuli Tutorial 1: Visually programming in python!
sentdex
Sikuli Tutorial 2: Program visually in python!
sentdex
Sikuli Tutorial 3: Program visually in python!
sentdex
3D Bar Charts in Python and Matplotlib
sentdex
3D Plane wire frame Graph Chart in Python
sentdex
Raspberry Pi Part 1 Introduction
sentdex
Raspberry Pi Part 8: First Download and Update! (Firmware)
sentdex
Raspberry Pi Part 10: How to set up a Linux Web Server on your Pi
sentdex
Raspberry Pi Part 11: Remote Desktop
sentdex
Twitter Analysis: How to rank a user's influence
sentdex
GPIO Tutorial for Pi Part 2 - Programming the GPIO
sentdex
GPIO Tutorial for Raspberry Pi Part 1 - Setting up
sentdex
More on: ML Pipelines
View skill →
🎓
Tutor Explanation
DeepCamp AI