Applied Data Science With Python Full Course 2026 [Free] | Python For Data Science | Simplilearn

Simplilearn · Intermediate ·📊 Data Analytics & Business Intelligence ·2mo ago

Key Takeaways

Covers applied data science with Python for data analytics

Full Transcript

Hey everyone, welcome to this course on applied data science with Python. Today data is everywhere from shopping apps and social media to healthcare, finance, even entertainment. Data help businesses make smarter decisions every single day. But raw data alone is not enough. What really matters is knowing how to clean it, analyze it, visualize it, and use it to solve real world problems. And that is exactly what this course is all about. This course is designed to help you move beyond the basic Python and start using it in practical data science workflow. You will learn how to work with data sets, create meaningful visualizations, build machine learning models, process text data, and understand network relationships using Python tools that are widely used in the industry. First, we will learn about how to clean, organize, and prepare data using Python libraries like pandas and numpy. Next we will explore data visualization using mattplot lib seaborn to turn numbers into clear insights. Then we'll move into machine learning with scikitlearn and understand how predictive models are built and evaluated. After that we will look into the text mining using NLTK to work with unstructured text data. And finally we'll be exploring social network analysis with network X to understand connections and relationships in data. Also, if you are interested in mastering the future of technology, do not forget to check out the professional certificate course in generative AI and machine learning, which is the perfect choice for you. This is offered in collaboration with the ENIT Academy and IT Kboard and it's an 11month live interactive program providing you hands-on expertise in cutting edge areas like generative AI, machine learning tools like chart GPT, DL2 and even hugging face. You will be gaining practical experience to 15 plus projects, integrated labs, live master classes delivered by esteemed ID Kpool faculty. Alongside you'll earn prestigious certificate from IT Kpool. You'll also be receiving official Microsoft badges for Azure AI courses and career support through simpler's job assist program. So what are you waiting for? Hurry up and enroll now. The course link is mentioned below. Now before we get started, here's a quick quiz question for you. Which Python library is commonly used for data cleaning and manipulation? Your options are pandas, seaborn, network x, nltk. Let me know your answers in the comment section below. >> So what is data science? Okay, so what is data science? It is we had to put a definition to it. It is a multidisciplinary field that uses scientific methods, processes, algorithms, systems to d this is the big part of it derive meaningful insights from structured and unstructured data. So that's the that's the big part of it is obviously data is in the name data science. So our goal with data science is to derive meaningful insights from data. And so that can involve many things. That can involve building models which we will learn how to do later on when we get into machine learning. That can involve building out visualizations. That can involve um doing uh hypothesis testing. That can involve many different things. But we're deriving some type of insight or uh finding out something interesting about our data that we have through different methods. Um, so it's it's kind of I always like to say it's the science of working with data, which is um, you know, kind of a weird way of saying it because that's obviously the name data science, but it truly is the science of of extracting insight from data. um is is this field and there's a lot involved in it um that blends a lot of different uh disciplines together. It's why we say it's multi-disiplinary because um we kind of bring together multiple uh aspects of math and stats, computer science, domain expertise um to be able to derive those insights. So um there's you know things there there's ideas and concepts borrowed from uh uh science like like doing hypothesis testing. There's um obviously math and stats a lot borrowed from there. Um there's a lot borrowed from visualization and analysis that we that we use in data science but we kind of blend that with technology right with computer science. So being able to use Python is a big deal. That will be our our language of choice for doing anything data science. That's why we kind of reviewed it in the beginning. So we'll certainly use Python. Um we'll certainly use different tools within Python to process data. Those are data processing tools. Um so we kind of blend those together to form the field of data science. So th those scientific methods along with um tools working with data in from technology like Python blend those together we kind of get data science. So where do we see data science today? Um there's a a lot of different um applications and I'll try to describe some of them to you which is uh for example like wearable devices. Uh so think about like Fitbits or Apple watches. They're always crunching data from their sensors, right? So there's some type of biometrics that are captured and sent over uh the internet essentially um to uh basically um allow us to do some some type of uh analysis or some type of derivation of insight from that data. uh and then we can kind of visualize that with some graphs or some meaningful metrics um so that the person wearing that device can make some sort of decision. So there's some type of insight derived some and usually some type of algorithm being applied to that. Maybe a model's being built working with the data that's captured from the wearable device or we're just plotting that data or we're summarizing it in some way but making it really useful to the to the end user to make a decision off of. So we're deriving insights there um from all of the data collected in the the wearable device. So that's kind of one application. Um, search engines use data science to uh personalize results or offer recommendations as people type in their their queries. So um essentially like you have suggestions, right? like and these could be based on your previous browser history. Um all kinds of data like your cookies or your um what's trending in the in the world or like your region. Um though you probably have noticed this, right? When you're searching on on search engines, you get these suggestions. Those recommendations are kind of powered by data science, right? So data is going into that to derive some insight that hey this is what we should s should suggest and that's uh it gets surfaced there. One of the things that we'll look into down the road when we get into machine learning which is the next course after this will be recommendation systems. So how do we actually build those models that do recommendation? But in order to build those you really need data. Um uh so that's where we got to start is working with data to to um get in a position to model off of it. Uh finance we see um usage of of data for instance um there could be like some type of model that's built to uh determine um a loan decision. So there could be the application and then there can be additional data that's gathered um via the details in the application um and then that all that data can be brought together and kind of analyzed through the use of a model that could predict yes we should give this loan or no we should deny this loan. Um so again you're deriving some type of insight from that data um to make a decision right to make some sort of loan decision in finance or it like we see a lot of um fraud decisions made as well like is this transaction fraud or not fraud that's another big use case of data science so based on the data of transactions are they fraudulent or not fraudulent Okay, a lot of other applications. Those are just a few. Um, and as we go along throughout this course, we'll really study a bunch more and get into some more use cases as we go along. This is just a preview. Um, but what I want to go to next is kind of the uh process by which we attack data science problems. So we know that um we should be deriving insights from data. That's what data science does. And the question is like how do we go about doing that? What is a good systematic way of doing it? So we're going to describe that in the process. Uh good question. Differences between data science versus data analytics versus data engineering. >> Yeah. So um the difference with data engineering, let me start with data engineering. Data engineering is mostly centered around building like pipelines and building systems that can uh collect data. So it's more about um engineering systems to like collect data and house it and store it and facilitate data um transfer and data availability. So it's more about working with the data to make it available and collect it. Um, and it's it's a lot more engineering heavy as the name suggests because you're building pipelines and building systems that will um like scrape data from sources or pull data from APIs or do those kind of things that um will collect data for you. Um that's a little bit different than data science. Data science remember is going to be generally like building models or doing visualization or both derive some sort of insight find some sort of pattern or find something interesting about the data that's not already not already or hasn't is not readily known um just by looking at the data. Um so with that being said there's a lot of overlap of data science data analytics but I would say analytics is a little bit more um static in the in the sense of a lot of analytics are just doing basic rule systems or basic visualizations. Uh data science usually goes a step further and builds more sophisticated models to make some sort of prediction make some sort of decision off of. So data science generally involves more modeling than you would see in data analytics. Whereas data analytics goes a little bit deeper on doing like visualizations and doing more um like experimentation um statistics gathering. Does all that make sense? Like data science is is going to be he more heavily on the modeling side to to derive insights. That's a big deal about, you know, how to derive insights is is from building a model. Yeah, that's a great question. I know that's there's a lot of those overlapping terms. This is a really good question. Any other questions? Okay. All right. Let's talk about the process. Okay. So, the process is outlined roughly here, but I'm going to take us through every single step. Um but if I had to break this down into um into basically a handful of key steps, it mostly starts at um here which is kind of initial problem formulation which involves collecting data. So usually you have problem formulation. So you need to know like what are you know do we want to build a recommendation system for movies? Do we want to um figure out if transactions are fraud or not fraud? Like we're coming up with some problem that we want to solve and then going out and collecting data that might help us uh analyze that problem further. Um so we'll talk about that. Then comes the kind of uh data preparation phase I would call it. So data prep meaning that we are preparing our data for modeling um with with that goal in mind is our goal is to actually build some sort of model to tell us something to help us make a decision or derive some insight. Right? And so um we will do a couple steps. I'll describe what they mean in a moment of doing some data preparation in order to get ready for modeling. Then comes the actual modeling uh phase. The actual modeling phase which involves um building, training, evaluating all of that. Um and so there then comes the modeling phase and then comes the actual um like uh deployment of of the model meaning that we um use it in the real world uh to to um bring our insights into an actual uh system integrated into actual system so that those decisions can be can actually be made and used by by an end user let's Okay. Uh I believe you can call a data science process even if no model is built at step four five. Um you can but so I think that's fair is to say like if even if you're not building a model so you're you're thinking just steps one two three four. Yeah that's true. Um or even just one two three. I think that's fair. But most of the time in this in this uh course or in this program, we're really going to be building models. Um so I want us to have this in mind is like even if we stop here and don't do any model building, this is what we have our eyes to is the ability to the ability to uh model if we want to. Okay. But I think it's a fair point is that it could be um you know it's a fair point that you could just imagine getting your data together as really like um kind of a data science process is feature engineering the feature that will use the so so uh feature I'm going to describe what feature engineering is in a moment. I'm going to go into that in further detail in the next couple slides. So just hang on when I when I we say feature engineering. We'll talk about it. We'll talk about it. Um I was thinking for AB testing and prior there's no model. Would you call this a data science project? Um I that's up for debate. Um AB testing sometimes that falls under analytics, sometimes that falls under data science. I it Yeah, sure. it we'll we'll talk about doing AB testing so even when there is no model so I think it's fair to say it's part of data science um which would kind of just be steps 1 2 3 4 so yeah I I see where you're coming from I see where you're coming from it's it yeah traditionally AB testing is more like experimentation analytics I I agree it's not traditional data science but we will cover it Um because it's an important aspect of of data science with with the idea that the you know the part of deploying a model and seeing what the results are would be like an experiment right it would be we think of it as an experiment where I build some model see what the results are on on a different group than a kind of a baseline control group that's not exposed to that model. So it's important for us to know what it is. You could argue the actual process of doing AB testing is not really data science. But if you add in that context of usually um people are interested in using it in conjunction with building a model and and exposing it to users as an experiment. Um does that make sense? Yep. Okay. Very good. All right. All right. Let's go into each of these steps and talk a little bit more about them. Okay. So, excuse me. Usually things start in the beginning with a a problem definition which is a goal uh or a question that will be addressed through collecting and analyzing and deriving insight from data. So that's the very first step and usually this this is actually something that you would work together with other uh uh colleagues usually to come up with. So usually you may come up with this with like a product manager or with uh another engineering group or you know something like that to come up with a problem like hey we need to build a recommendation system for our users to get better recommendations for their movies or for their shows or for you know their products on our website. Um so so there usually is going to be a question or a goal you know or hey we want to come up with a sales forecast for the next um three quarters uh given the data we already have for this quarter um or the previous quarters. So you know usually there's you have to start with a problem. So you have to start with a problem definition um that you want to you want to address. Um and and honestly that goes hand inhand with the next step which is once you have a problem in mind you then have to collect data around that problem. So you have to gather the relevant data sets um which could also involve working with external um partners to do that. So maybe you have to work with data engineers to help you go out and collect that data or make that data accessible. Um you may have to work with uh you know you may have to do it yourself and go and gather historical data that can help answer that problem. So it sometimes this is up to you. You have to uh go and and go out and collect that historical data or that data that's relevant to your problem. um that is totally possible but sometimes you're also working with external partners like a data engineer to make that possible to to go and collect the data but of course the key word here is relevant right the data needs to be relevant to the problem that you're solving and that can be a challenge so I know these are listed as the first two steps and they seem pretty straightforward but they can be the most challenging at times is defining the right problem and collecting the right data getting that available um is not always trivial. But in this course, we'll usually assume that we have these two things. We'll usually assume that yes, there's a problem we we know we want to address um for practice purposes. We have that. And then we also have data already available for us. We don't have to go out and fish for it and collect it and scrape it from somewhere. We'll assume that's already been done and we're just using the data that we have available. So in this uh in this course um this these two steps will usually be done for us uh mainly just for um so that we can practice all the other steps. But but in the real world that would usually be you know someone you're working together to formulate a problem with with external stakeholders and you're also gathering data either by yourself or with with the help of um maybe like a data engineer or someone like that. Okay. So again uh these two things we'll usually have uh um in this course which leads us to the next phase which is that data preparation phase. So this is where I said um we need to clean the data and explore it a little bit. So usually this process can take some time. So this can be a very timeconsuming process. But the typical tasks we're doing here are getting a handle on any missing data. So figuring out a strategy to handle missing data, we'll talk about that. Um how to identify and handle outliers, we'll talk about that. U maybe duplicate data, inconsistent data that that um doesn't make sense, we can identify that. We can get rid of it or have some strategy to handle it. Um, so getting a handle on cleaning up our data is going to be an important task and that might take some time. That might take some time to work with our data and kind of do some we have to learn how to do the proper code, clean it up, get it to a good state. And then once we do that, we can start to explore it a bit to gain insights. And this is where we'll usually use um visualization at our at our disposal. So maybe we'll build some graphs to quickly visualize, get some patterns, see what see what the data looks like, um figure out if there's any relationships in the data. Um that's where we'll learn about visualization that will really help us. So when we say explore, we're mostly talking about building um graphs to help us kind of tell a story about what we see in that data. Um assuming that it's been cleaned up, right? Assuming that we have it cleaned up, we've removed all our missing values, outliers, inconsistent uh values. Uh we now have a clean set of data, we can start to explore it. So usually um doing visualizations. It could also be it's not only visualization, but it could also be summaries. So maybe it's really useful to tell a story like what is the average for all these users or what is the average sales for the last few weeks? um what is the median sales like those kind of statistics might be really useful summaries to tell a story about the data. Okay. Okay. So we have our steps here. Problem definition, data collection, cleaning exploration. First three steps. Any questions about those first three steps? Good. Okay. All right. And again, we're going to like as we go along, we'll we will we will deal with uh we will deal with these um problems. We'll do have actual examples that will deal with this process. So we'll see this process from end to end many times as we get into our examples later on. Uh bronze, silver, gold. So usually those are like data engineering terms to refer to uh this the basically how clean the data is. Um so like bronze is kind of like the rawest form. Um it it's usually like data that has not been aggregated in any way. It's usually pretty raw. hasn't been cleaned up in any way. Silver might be cleaned up but not really aggregated. So silver might be like we've removed missing values, we've done some um we've removed outliers. Uh so silver is a bit cleaned up but it's still kind of raw. And then gold would usually be like our final transformations have been applied to it. Like maybe we've done some averaging um we've done some transformations. So usually th those are terms that you see to refer to the different like stages and of quality. Gold being like the highest quality like the final data set. Yeah. Yeah. Model will run on gold. Yep. Yes, that's true. We don't really, by the way, we don't really use those terms too often. I think you see those terms a lot in like data engineering. Don't really see them that much here just because the assumption is that we're always going to clean up like our models aren't going to be good unless we get to a gold state, right? Unless we clean up our data, unless we do the right transformations, that makes sense. Like our models really aren't going to be useful until we reach that state. So, our we're always going to be pushing to like clean up and get to a good state. Okay. So, we have one, two, three. Let's look at the next few steps. So once our data is clean and we've explored it a little bit, this is step four which is feature engineering. So this is the other kind of data preparation step that I talked about. So feature engineering, what is that? It is a um creation or transformation of new features. So you might ask what is a feature? A feature is just a variable that is an input to a model. Okay, a feature is just an input. So I want you to think of like in an Excel spreadsheet a feature is something like a column. It's like a it's a independent variable like a column that we would use to build a model off of like that will be one input. We would call it a feature that's going into the model uh as an input. Okay. So feature engineering is the process of kind of building new features and those can be by doing simple transformations like maybe we do some scaling like dividing by 10, multiplying by 10. Those are simple scaling we can do to features. Um we could do more complex transformations like doing a linear transformation to it. Um we could take a square root, we could take a logarithm, we can take an exponential. um many different transforms we can do to our data to create new features. Um and and sometimes that makes sense to do that. Sometimes we don't need to do that much feature engineering, but that's something we will get a feel for as we start to do examples is like when does it make sense to do feature engineering and when when do we not have to. Um so feature engineering will be more of an art than a science honestly. And um we're going to do plenty of examples where we do feature engineering to see what kinds of transformations we typically will do. Uh it's part of the current data. Yeah. So all everything we're talking about will be part of the current data we're working with. Is it always performed? No. uh we don't necessarily always do feature engineering. Um but it is a step like we can and we should evaluate if we should. Okay. So will we always do it? No. But we have the ability to do it and it is a it's a step worth calling out because it can be very valuable to do. So again, we haven't learned how to do that yet. So it may not make that much sense to us, but I'm calling it out as a very important step in the process. And as we start to do examples later down the road, um we're going to come back and spend some time on feature engineering because it is important step. It is an important step often to do it. Okay. Yeah, it's I I see how it can be a little confusing. Yeah, feature is in data science, feature is just a variable. It's like an independent variable that's uh an input to the model. Yeah, it's a little bit different of a meaning than like a feature like a which is kind of a um a new piece of software you're adding to existing software, right? little bit of different um terminology there. Okay, so this was step four and this was um data prep. So draw a line here because this is all like the data prep and then these couple of steps here are all modeling oriented. So once we have our data prepared, we've cleaned it up, we've done exploration to determine what we should keep, we've um done some engineering to do some scaling or transform the features that we have, we can now build a model. And so this will all be um something we will learn in our next uh we will learn how to build models in our next course on ML. But just calling it out as you know that would be the next natural step is once we have our data cleaned up to derive an insight from it we may want to build a model off of it. And so that will involve um kind of defining a model training it on this data that we've prepared um as as step number five. And then step number six will be kind of an evaluation step of um determining if we uh have a good enough model by evaluating it. You know this, by the way, this process isn't necessarily linear in the sense that we may iterate here and go back and um you know, repeat these steps. So, we may go back and forth and keep repeating these um by building a new model, determining if it's good enough, repeating and repeating and repeating that process. Um and we may even go all the way back up to here and build some new features if if the model isn't doing that well either. um that may be possible. So by no means is this process always linear. We may repeat especially this model training and building and evaluation steps. We definitely can repeat these back and forth um until we kind of converge on a good enough model. That's something we'll discuss when we get into modeling. I don't want to go into details now but that is suffice to say like we can iterate those uh steps quite a bit and we can spend a lot of time on it for sure as a as a data scientist. Okay. And then oh by the way there there's one more. So there should be a step here on deployment which is kind of mentioned here but I would argue it's its own it's its own step but once we have the model um done then we can kind of deploy it. Um now deployment means a lot of different things. There's a lot of different ways to do that. Um we won't get into that till much much later in in the program. Um it's it's not really going to be a focus for us at the moment. We're mainly going to focus in on all of the data preparation steps and then all of the modeling and then leave this part till kind of the very end of the program. Okay. But it it is important part like we need to make you once we build a model we do want to have it be useful in the real world. So we need a way for it to be integrated into existing you know existing systems which can be different there different ways to do that. Um, what is the percentage of the model has to see for Oh, no. That's not a silly question at all. Um, it it's really like your tolerance for uh if you think it's good enough. So it I've for instance I've worked with people um where we've deployed models that have been like 70% accurate and that's okay because we just want to get something out there to to test with and to um have results. So yeah like the the target percentage is is different depending on the problem. Um that by the way that's something we'll talk about when we get into our next course on models and evaluation because it turns out there's different ways to evaluate um the accuracy. Like sometimes we care more about it's not about overall accuracy but more about like are we limiting our false positives versus false negatives because sometimes a false positive is going to be more costly than a false negative or vice versa. So like a false negative may be way more costly than a false positive. So we're going to have different ways to evaluate uh our models and that can lead us to different different thresholds. But like I I it doesn't have to be perfect. I've seen a lot of models deployed at the like 70% range 60% range. Um that's okay. it. Mainly the reason to do that, by the way, is to get something out there so that you can iterate on it, right? You you get something into production and then you can iterate on it after the fact. So you don't want to spend forever waiting for it to be perfect. Okay. All right. Any questions about that process? I know we haven't uh covered all the details of it but just remember right now it's you know we gather we we formulate our problem gather data we do data prep then we do modeling and the data prep is going to be something we will focus heavily on in this course. Okay for the moment the data prep will be something we we really key in on um in our last couple of lessons. Okay. Things like data cleaning, things like feature engineering. Any questions? Uh, will we cover precision recall? Yeah, we will, just not now. Um, we'll cover that later in machine learning when we get to model evaluation. Yeah, really no need to cover it now. So, not we're not building any models yet. Okay. All right. So, let's go back and circle around to Python. I know, you know, we spent some time on it already, but just to reiterate that Python will be our friend here when we're doing data science. So it'll it's going to be the preferred programming language for anything data science and that's true in the industry. Um, Python is widely used mainly because it has so many great packages to help us work with data, namely NumPy and Pandis, which are the first two we will look at and then it has many others to help us build models like scikitlearn uh which we'll get familiar with and then it has others to do visualization that we'll study. Um, so it basically has packages that do most of the tasks for us that we're interested in doing. So there, you know, that's why we'll stick with Python. It's really great for data science. So we've talked about this before. It uh why we why people prefer to use Python is because it's open source interpreted. It has so many great packages that are uh that are oriented for data science and and can help us do data science really easily. Um a lot of people used to use R to do data science but people it's been a shift um over towards Python because of its flexibility. Um Python can integrate with other systems pretty easily whereas Rs R is more difficult to use. Um yeah, R is like another it's like a scientific uh um analysis language. Um you know it's it's used in a lot of like statistics. Um a lot of statistics people like using R but uh for doing data science it's almost exclusively done in Python. So there's really no you don't see R too often. I've I've really never seen it. I've only seen Python in the industry, so no worries about the R. Historically, R has been around uh R has been around for a while, but um Python is by far and away the most used uh data science uh language I've for sure. Okay. So, I want to briefly tell you about some of the packages that we are going to study in in our course. um that are in Python that we will use to do data science. So I just want to briefly talk about them and then of course we're going to have um a couple of lessons dedicated to going into those like numpy and pandas and all the visualization libraries. So the first one is numpy. So numpy is uh short for numerical python. So that's that's why it's called numpy the numerical python and it is a python package for doing computing basically scientific computing uh using these uh array structures that numpy has created um and so many things are built off of numpy arrays and the ability to operate on these numpy arrays. So, NumPy came around and um created these multi-dimensional arrays which are essentially like matrices um and and also had a lot of different um computing uh tools around the the matrices and arrays that so many other packages are built off of. So, we're going to learn pandas. Pandas is built off of numpy. So is uh map plot li which is for plotting and so many other packages are built off of numpy. So it's a really foundational um package for working with data because data will be stored in numpy arrays. The numpy array is the kind of foundational data type of numpy and and so many things work with numpy arrays. Okay. So numpy will that's going to we're going to have a whole lesson dedicated to numpy coming up next. But numpy is going to be the first place we're going to start just because it's so important for working with data. It's multi-dimensional arrays are so useful for storing and manipulating data. Um so it's it's pretty important. What is a for transform? It is a transformation of uh data into like a signal uh basically like a signal transformation. And so you extract you go from like a a basically like a time series into like a frequency series. It's used in signal processing. Yeah. Analog to digital. Yeah, pretty much analog to digital. Yeah, it's used in signal processing. Okay. So the second package that we will study, so we'll start with numpy. We'll start with that today. So right after this lesson, we'll dive right into numpy and start working with examples of the numpy arrays. But um right after that is the library pandas, which we'll spend a lot of time with. Pandas is a library built off of numpy. So it it depends on numpy and it basically comes around and provides a more structure uh to manipulating data. So if you're like I said earlier, if you're familiar with Excel, pandas has a lot of functionality that mimics what you would do with a spreadsheet. basically like structured row column data um is is what pandas excels at. Um so pandas is going to be a really fundamental package for us to manipulate um data that's structured in kind of a row column matrix format but it's built off of um numpy. So it uses numpy under the hood to do all the manipulation but pandas provides its own data structures to kind of put data into almost like a spreadsheet format that we can manipulate. Okay. So really pandas is going to be really really powerful for us to manipulate data and we'll use it all the time. So if anything coming out of this course you guys will be pandas experts if anything else. Yeah, I mean of course you'll learn more than that but I think you'll come away as being really really good um users of pandas and numpy for that matter but but certainly pandas. So we're going to study numpy first and then we'll have a lesson dedicated to pandas right after numpy. So we'll have a lot more to say about it, but I just wanted to kind of preview that, you know, it's a really important package in the data science ecosystem because it helps us manipulate that structured data that's in like a row column format like a table. Okay. Then another package is the sci package which is um short for scientific Python. It is another open source library that's built on top of numpy. So it uses numpy arrays as its underlying uh data structures to do the manipulations. Scy contains a lot of um scientific formulas and a lot of um scientific computing uh tools that we'll use especially when we get into hypothesis testing. So it contains a lot of like Z test, t test, distributions, things like that. Um so it's it's tailored for that. It also has things like the forier transform as well. Um it has different linear algebra manipulations as well. Um so sci will be really useful uh when we get into our hypothesis testing and AP testing. It has those kind of uh um those distributions that we'll need to do our tests like a like a student t test or a z test or those kind of things we'll we'll use scypi for. So really important package we'll see later we do hypothesis testing. Um another one that is going to be useful from time to time is the stats models package. So it is one that um basically has a lot of statistics oriented things. Um it it has um some basic models in there like like linear regression or logistic regression. Um we will generally favor a different package to do those kind of models but uh just calling it out that um stats models does have some useful stuff when it comes to doing statistical testing. So there are some like kiquare tests or ANOVA tests that we will borrow from stats models that sci um we can borrow from stats models. So we will use it when we get into hypothesis testing as well. So these last two, so sci and stats models are two packages that we'll use when we get into AB hypothesis testing. Okay, so that brings us to scikitlearn. Now this is going to be our primary package for doing machine learning. Um, so this will be one that we'll build all of our models and machine learning off of when we get into our machine learning course. So we we won't really use scikitlearn in this current course, but when we get into machine learning, uh it will be our go-to package to do all of our uh machine learning with. It is a fantastic fantastic library that's been um developed over over years to contain all the basic models that we would ever want to build. Psycharn's really awesome. Um, so it can it can build models for so many different use cases. Um, and it's a really easy package to use. It has a really nice interface, really easy interface. So we will see that later on when we get into our next course on machine learning. But just calling it out that is a very popular uh data science library, scikitlearn. So when we get into our modeling, we will use scikitlearn. when we do our data prep manipulations we'll be using numpy and pandas. Finally for visualization um for visualization we will be using a library called mapplot lib. So it is kind of the foundational um python plotting library that borrows inspiration from uh from from mat lab. So if you guys have ever used the mat lab plotting um it's actually very inspired by that hence the name Matt plot um from mat lab um but it's it's going to be our main tool for using uh for for building graphs. Okay so um it's a foundational library for building graphs almost every other library that does visualizations is built off of this one built off of map liib. So when we get into our visualization course, we will come back and do uh we will come back and talk a lot about Mattplot lib and practice with map pot lib quite a bit. Excuse me. What is that course called? Uh machine learning. Our next the next course is called machine learning. Okay. And then another uh visualization library that we will lean on heavily is the seabor library. This is one that is built on top of mapp. So matt mapplot lib is kind of like numpy. It's the foundation and then a lot of things are built on top of it. Seabour being one of them. Excuse me. Um Seabour being one of them. And it it basically has just better aesthetics. It provides better not only like better aesthetics than just basic my web. It also has more um scientific kind of plots and more interesting plots than the regular ones you get out of the box with map pol. So it has really interesting histograms, file plots, heat maps. Um it can do statistical error like confidence interval bars. Um so it just builds better plots than than basic map li. Map pod li's very basic. It can it's really easy to use. You know you can build a lot of plots with it as we're going to learn. But Seabor is really nice. It makes things aesthetically pleasing. And so we'll also use Seabor from time to time. It's another plotting library that we'll get some practice with uh when we get into visualization. And another one of those is Plotly. So we have Seabor and Plotley both built off of Map Plot in order to do plots. Now Plotley's specialty is for building interactive graphs. So when you build a plotly graph, you can actually um it'll pop up in your web browser kind of like Jupyter notebooks do and you can click around in the graph and mark down points. You can zoom in, you can zoom out. Excuse me, just getting over a cold here. So don't mind don't mind the coughs. Um but you know, you can zoom in, you can zoom out, you can do a lot of interactions with potly. So if you want to build an interactive graph, potly is a good package. Again, it's built off of uh Mattplot liib. Uh so we'll get some practice with plotly. So so these three we're going to practice a visualization. Seabor, mattplot liib is the basic foundation. Seabour and plotly both build off of it. We'll get some practice with all three of those when we get into visualization. Um so the rest of these slides just go through some plots that we will be building later on when we get into our visualization. So uh just wanted to brief briefly go through those uh just to show you some of the different types of plots we'll do. So the easiest kind is basically a line plot that connects different points. So this would be like if we were plotting out something over time like a stock price or uh a sales value over different quarters or or weeks um temperatures over time something like that. So basic kind of plot we'll be able to build that no problem. Um we can even mark different points on those. That'll be easy to do with mapplib or seabour or potly. That'll be really easy to do. So again, we'll we'll show you how to build these with code later on when we get into visualizations, but just showing you the possibilities right now. Uh scatter plots. We'll do these which have um different points kind of uh scattered throughout on on uh two axes here. Um, this is usually helpful to figure out how the data is kind of um, maybe clustered together or figure out if there's relationships between two variables, like if they tend to trend the same direction or in the opposite direction or if they're kind of just distributed all over the place. So, we'll be able to build scatter plots. That'll be helpful. Um, area plots that show like cumulative areas on top of each other. We'll be able to show that. Um that'll be pretty easy to graph for different um maybe tracking total sales uh over successive quarters. Um showing different contributions of categories we'll be able to do. Uh so we'll be able to do area plot basic bar plot. We'll be able to do uh again these all of these examples were built using mapplot lib. So we'll be able to do that but they have equivalent versions in pli and seabour. So um again th those are built off of map plot li. So uh we can even put grids in the background to show uh to to kind of um assist the viewing of it uh to to give an idea of where the different points are in the grid. So that will be easy to do. Um histograms. Now histograms are going to be extremely useful for us. We'll build histograms a lot because they will help us visualize how data is distributed which is extremely important to know um you know is it kind of distributed like this in this picture which is kind of like a bell curve or is it flat? Is it um does it have kind of two peaks to it? Um knowing this distribution will be extremely useful to us. Um so the we will often build a histogram that kind of looks like this. So histograms will be extremely useful. Uh we can build piraphphs um which show different percentages. Uh so um you know there may be certain situations where that makes sense. We're telling a story of our data. It makes sense to use a pigraph. That'll be easy to do. Um the again these are all just examples of what's possible. We have to show you how to build these and we will when we get into the data visualization lesson which is what this kind of note says at the bottom. Once we get into that lesson we will show you um how to do the code to build these. Okay. So just to wrap up this first introductory lesson, we uh have shown you what data science is, which is kind of the uh extraction of insight, deriving insight from data. Um and we have a bunch of different packages are going to help us do that. We also have a process which is going to help us do that which is usually defining a problem, collecting data, doing data preparation and then doing modeling after that. Um so you know basic foundations at this point what we're going to do is now go into uh numpy. So we're going to um start with numpy and then go into uh pandas after that. So we're going to start studying these packages are going to help us do some of these different tasks in data science. Any questions at this point? Okay, I'm going to open up our next lesson then, which is actually going to be So, if you guys notice, the next lesson is actually going to be uh lesson three is broken into several different notebooks. So, we're going to be transitioning into uh notebooks for some of these guys. Um do you guys have those notebooks? The lesson three uh notebooks. I can try to download I can try to share them with you guys if you don't or does anybody have the folder and want to share it for those that don't have it. It should be a collection of several notebooks. Yeah, let me download it. Okay. So, give me a moment. I will upload it. I have it right here. Um, okay. I just uploaded it. Okay. So, you guys should have it. So, you want to open uh those notebooks. Again, you can uh open it wherever you want. Um you could do it in your own local uh Jupiter. You could do it in the lab environment. You could do it in Collab. I recommend Collab just because it's so easy to work with, right? So, I recommend opening them up in Collab. That's what I'm going to do. I think it's just so easy to work with. Uh so that that's what I'm going to be using. All right. Can everybody uh see the screen? I'm on the first notebook. The the 3.01 notebook is the one we're going to start with. The introduction to NumPy. Okay. So if you have a moment, you want to open that one up. Again, if you're working in Collab, you can upload the notebook. So you can uh go like once you've extracted that folder, you can upload the notebook um into Collab uh using this like file upload notebook. Um that should work. Or if you have Google Drive, you can put it you can just upload that folder into your Google Drive and then you can just uh launch it through your Google Drive and it should open it in in Collab. That works too. Are people able to open the notebook? Yeah. Again, it doesn't matter where you open it, just as long as you can and you can you can uh run some of the cells, you know, because that's we're going to be running them. Good. Good. Good. Okay. All right. So, let's talk about numpy. Now remember, numpy is the open-source library that is used for doing um you know that is used for doing uh math and scientific uh computing on uh basically these arrays. So um we are going to take a look at the numpy array object as the first thing that we'll look at. Um now the numpy array object behaves very similar to a list. The so we learned about lists in our previous course and the the numpy array is very similar to a list. We can slice it like a list. We can access elements like a list. It's ordered like a list. Um but it's a lot faster to do mathematics with the array. And it comes with a bunch of built-in functions like mean, median, mode, all these special things on the array that we don't get with a list. For instance, Python lists do not have a notion of an average. You can't calculate the average of a list without doing a manual calculation. So, but a numpy array has a mean function that comes built in that we can uh take the average um numpy has like an average function that we can take of an array or a median or a mode. So, arrays are really advantageous to work with inside of numpy. Um so, uh let's take a look at some examples of a numpy array. So in the in the first uh cell here I want to point your attention to two things. One is that in order to use numpy we import it. Do you see how we import numpy and we do this thing called asmpp. Now what this is is we do an alias. This is called an alias. Alias numpy as np. We basically shortorthhand it to NP which is an industry standard. So anytime you're looking at code and you see NP something that is short for numpy. So in the industry if you you know everyone is going to shorthand numpy to MP. That's just that's just what people do. So as is the way to alias and import so that when we use numpy in our code we don't have to type out the full word numpy we can just do np. So that's why you see np here is because uh and really throughout our code we use np you see it all over the place. It's it's a shorthand alias for the numpy package that we're using. So uh we are importing this package meaning that we are going to use it in our code but we are aliasing it to np. Um this is the this is the industry standard to do. Most of these packages have a nice uh alias to them like pandis will have an alias uh mattplot liib will have an alias um just to make it shorter. Uh do you have to import in VS Code? Um if you're using VS Code to run your notebooks, yes, you have to upload it there. You want to open it in VS Code? Yeah, you're going to have to you're going to have to open the folder where it exists. But that's if you're using VS Code. Like you don't have to, but yeah, if you want to. Yes. Okay. All right. So the next thing I want to point our attention to is building a numpy array. So notice that we can build this numpy array by doing np.ray. So np. array np. array um builds a numpy array object. And what we're passing in is just a list of data. Okay. So we have a list of integers that we pass into this MP array which will build a numpy array out of this list. So numpy arrays can be built out of lists, they can be built out of tupils, they can be built out of other numpy arrays. Um there's many ways to build a numpy array, but the most common is to pass in a list to convert a list into a numpy array. So here um here we are uh building a numpy array from a list which is pretty typical. By the way you guys remember I said that I'm going to be writing a lot of comments. I'll share these notebooks in our Slack after after the classes. Um, but I encourage you guys to do the same thing is to write comments in your notebooks. Okay, try to write comments in your notebooks to outline what the code's actually doing. How do you install numpy? Uh you just need to do so inside of a cell inside of a cell you can run this uh command like pip install numpy. Try running that inside of a Jupyter cell. Yeah, this this command is not going to work for you because this is like a generic this is on um this is on Windows like a generic Windows command. Um, but if you're inside, are you inside of a notebook, Mariel? If you're inside of a notebook, just run this inside of a cell. It should install it. Yeah, that works too. You can open your terminal and do pip install. Um, if you do that, you'll probably have to restart your kernel. No. So, so Collab comes with NumPy already installed. Yeah. So, that's another advantage of of Collab is it already has that installed. We don't need to worry about it. Yeah. So, so if if it says requirement already satisfied, um, which is what this is going to say if I run this, um, it's because it's already installed. So, it this means that I already have it installed. Yeah, you already have it installed. Yep. Yep. So in collab it already exists. This one new cell and this command If you can't get it to work in your VS Code, I really encourage you to to do collab as much as you can just again just to get something that works because NumPai is already installed in Collab. So there's really nothing you need to do extra. Thanks, Tim. That'd be great. That'd be great. Okay. All right. So if we so this builds um going back to this this builds a numpy array off of a list. So if we run this code what's happening is we are building an array and storing it in this array variable and we can print the array. Now look at what the array looks like. It kind of looks like a list when we print it except we the way we can tell this is a numpy array is that it does not when we print it it does not have the commas. Notice that the data in there does not have the commas. And that's because um it's being treated as a numpy array. So it doesn't have the commas. It's not at that point. It's now an array. It's not a list. So it it looks slightly differently. And you can even see when we print out the type that this array is actually a numpy n-dimensional array which is the foundational data type of numpy. So this is an a numpy nd array which is the foundational uh data type of numpy. So we have created an umpire array and we now you can see what its type is is this uh MP and D array. Okay. Were you guys able to run this first cell? If you run it it does it's not going to do anything but show this. You should see this and then you should see it's printing out the the type. You should see those two things. And do we see how that this is creating a numpy array? So np.array is how we that's the function we use to build a numpy array. And we're passing in a list of data to build that array. All right. Yeah, it might take it might take a moment to start up the kernel. All right. Any questions on on this so far? What was indie array? It's short for in-dimensional array. It's the it's the numpy array object type. So that is the data type that we are working with now is a numpy n dimensional array. It's a it's a generic numpy array data type. And you can see that because we we do type and we can see that we get um we get a numpy array numpy nd array as the type of this thing when we create a numpy array out of it. So nd array is short for n dimensional array. Okay, what I wanted to do is go to the next cell and talk about how we can create some matrices essentially multi-dimensional arrays. So the this array that we've created so far is actually just a onedimensional array because it's it only has um it only has one dimension to it. basically has one list of data. But of course, we would be interested in working a lot of times with multi-dimensional data because that's typically like what a spreadsheet has, right? Rows and columns. So, um just to give you guys an example like NumPy actually supports zero dimensions which is basically a constant. So a single number a single value is considered a z array. So a single value is considered a zero dimensions is just a single value. So if we built a numpy array and just passed in a single integer or it it doesn't have to be an integer it could be float like 24.6 six, you know, whatever it is. Um, it that would be considered a zerodimensional array. But we've already built a 1D array which is just a single list basically a flat list with uh so just a if we use a list with um a list of uh I should say a single list of values is a 1D array. So we've already we've already seen that it is a scaler. Yeah, we would call that a scaler. Yes. Uh good. Yes, that's true. Scaler. Perfect. Perfect. So, a single list of values is going to be a one-dimensional array here. Now, what gets interesting is now when we do a list that has list as its elements. So, this is a list of lists is now a 2D array. So I want you to I want you guys to see that how we we're building a numpy array out of a list. But look at what the elements of the list are. They're actually lists themselves. So you see how within this overall list, the first element is a list that is 111. That kind of mimics basically like a row. So you think of it as like each list each list is like a row in a matrix in a matrix. So this two-dimensional array is really like a matrix, right? So so this is an interesting use case where you know of course we could have more than just these two. So we could have a third list here that is like um four five six and that would be valid as well. So this would be basically a matrix that has three rows and um each each row has three basically three items in it. So we would think of it as basically having three columns, right? So it's like a 3x3 matrix but it is two dimensions. It's a two-dimensional array. It has rows and columns at this point. So it has rows and columns. The two-dimensional array. Do let me ask you guys, do we see how this has two dimensions to it? It's a list of lists. So it has two dimensions. Does all the lists need to have the same rows? What do you think? What do you think would happen if Let's try it. Let's try making this a smaller dimension. Do you think this is going to be allowed? Is this what you're asking? Like if this can this be a shorter dimension? Let's see. So yeah, this gives us an error. So yeah, it it you're exactly right. This will give us an error that the dimensions do not match. So this this is uh not allowed. But let's see if I do if I add in the six, this should now be okay. And there it is. It's now this is now okay. No more error. Okay, no more error. So, yeah, it's still going to be an error if if the shapes do not match. So, again, if we got rid of if we made this a smaller one, that's going to be an error. Um, and it's going to tell us that we are uh we have one dimension that is inhomogeneous, meaning it doesn't match. It's not the same. They have one dimension that's not the same. its shape doesn't match. It's not correct. So therefore, we should correct that and make sure it is matching. Okay, so that is a twodimensional array list of lists. And by the way, that doesn't have to stop there. We could keep going. So now we have a threedimensional array which has lists of lists of lists. So it basically has one of these matrices as each element. So see how this has basically an overall list. So this has um each it has a list where each element is a 2D array, right? Each element is a matrix. So here is one of those matrices is the first element. And then here is another 2D matrix that is the next element. And this forms a threedimensional array. So if we print this out, we can see that the we get this 3D array where this first matrix this matrix is the first item, this matrix is the second item and um on and on and on. didn't understand how 2D is different from 3D. Uh does it do you see how Okay. So, do you see how with the 2D basically we take this whole thing and that's just one element of the 3D this this 2D matrix is one element and then we have another 2D matrix as the next element. So we have matrices are now the elements of the 3D array. Whereas look at what's the elements of the 2D array. They're just lists. It doesn't have to be two. That's just the example we have. It doesn't have to be two. But um by the way, you one thing that gives away the one thing that gives away the dimensions is how many of these brackets we have. So you see how we have two brackets and see how this has three brackets. Yeah. Oh, you guys got it. You got Roberto. Perfect. Yep. You guys got it with the brackets. Can I So what's an example of using a 3D array? Yeah. So something that uses a 3D array would be like a a batch of images. So let me give you an example. So an image is like a 2D array because it has pixels, right? It's basically an image is broken down like this that has pixels with whatever resolution. And so if we have a collection of those that is it's like we have a collection of these guys is a 3D array. Like if we have a hundred of those it's like a 3D array. Does that example make sense? Like a collection of images would be a 3D array because every image is a is a two-dimensional matrix of pixels. Yeah. A 3D array is a collection of matrices. Yep. Exactly. Does this Does this example make sense though? Like this this is a good one. I think I'm glad you asked it because I think it's a good one for thinking about what a 3D array is. Every element is a 2D matrix. What is the maximum 2D element in a 3D array? I'm not sure what you mean by that. Maximum 2D element. Oh, how many can you fit in a three? You can have unlimited as as much as the memory will allow. Basically, as much as your memory will allow, you can have unlimited. You can have as many matrices in a 3D array as you want until you run out of memory essentially. Okay, perfect. All right, perfect. So just to recap this um we have the numpy array. So we're we are able to build numpy arrays using the nparray and we're able to take a list of data and populate it into an array. Um and and then what we're going to do is just build off of this to learn how to manipulate that array and do different things with that array coming up next. All right. So let's go to the next notebook. Let's go to 3.02. Well, before I do that, any questions? Any other questions about this? Uh, building an array. Okay, perfect. So, let's go to our next notebook. Okay. Do you guys have the 3.02 notebook? Do you have it up? That's the next one we're going to do. So, take a moment to pull that one up. Yep, I see some thumbs up. Nice. Okay, so we're going to build off of that numpy array by taking a look at some attributes of array. So, so assuming we have an array, no matter how many dimensions it is, what are some attributes of this array that we can that are useful to us? Okay. So let's take a look at an example where again here we import numpy and we make we're initially making a 2D array, right? So we make a 2D array. Um so then what we're going to do is print out a bunch of the uh attributes about this array and we're going to explain what they what they do. So the first is if we ever want to know how many dimensions an array has there's actually an attribute for that which is the which is called n dim. So if we just do end dim by the way we access let me call that out here is we access attributes of an object by using the syntax um object dot attribute. So in this case we have our array is our object. So it would be like um ie i.e. array.shape would be an attri shape is an attribute of the array. Array is our object here. Okay. So that's our syntax to access different attributes. So the first attribute we're going to learn about is called in dim which gives us the number of dimensions. And so when we print that out you can see what it's going to be. It's going to be two. And that makes sense. It's a twodimensional array. Could the 2D array be? Yeah, that's fine. That's fine. You could use that. That's still two dimensional. Yep. Okay. Um so the first is ndem which gives us the number this gives us the the um number of dimensions which equals two in this case. Okay. And that makes sense. We know it's a two-dimensional array based on the fact that our elements are 1D lists. So it's it's a we have a list of lists. It's going to be two dimensions. Now shape gives us shape gives us the um gives us the uh quantity. So it gives us the basically the number of rows and columns. So in this case what this is saying is we have two rows and three elements in each row. So shape is giving us an idea of how many elements we actually have or what the shape of this matrix is. That's why it's called shape. So in this case we have a 2D array that looks like that kind of looks like this, right? Where we have 1 2 3 and then in this example four to five. So we have two rows. two rows and we have three columns. Two rows and three columns. Hence we get a shape of 2x3. Does that make sense? 2 by3. Yes, that should be true. Yes, that should be true. Yep, that should be true. Okay, so shape shape is going to be incredibly useful as we go forward because um there's a lot of times where if we have an array, we actually just want to know how many rows and columns it has. Uh which is the shape. So the shape is a good attribute to know. Um does the bracket define as a 2D versus 3D array? It's it's the fact that it's it's two brackets defines it as a 2D array. It has two brackets here. It's a list of lists. This is 2D. It can. So, sorry. Yes, a 2D array can have more than two rows. Yes, it's the fact that it has So, I meant to say I meant to say that the in this shape we're going to see two entries. Sorry. in the shape we're going to see two entries. Yeah, let me correct that. So the the dimensions will match how many entries we have here. So if we have two two entries uh because it's too twodimensional. That's that's what the truth that's the thing that um so sorry Roberto I was uh wrong on what I told you. it is when we have two entries here. That's because it's two mentions, not the fact that this is a two. It's two entries in the shape. By the way, how can we get a third row? Uh we could just add in another list here, right? So, if we add in another list that has three elements like um 7 8 9, this is now this is still a two-dimensional array. But what I want you to notice is what I want you to notice is that it's going to turn into this shape is going to turn different and this size is going to turn different but it should still be 2D. So see how this shape went from 2x3 to 3x3. Yeah. The size is going to be the product. That's true. That's true. I was going to get to that next. What is a shape with three rows? It's just 3x3. Do you see that now, Marielle? Do you see it? It's 3x3 with this has a third row. It's now 3x3. Yep. Three rows. Three rows with three elements each in a 2D array. Perfect. Yep. So now I want to talk about the size. The size is the total number of elements total number of elements in the array. So uh in this case we have um in this case we have 3x3 so we have nine total elements. So, this will always be the product product of this shape, right? Lots of questions. Let me see. I'm trying to keep up. Yeah, you can have as many rows, columns as you want. Yep, no limitation. Do you have a real world example of a 3D rate? Yeah, it's the one I gave earlier. It's like a batch of images would be which we are going we're going to work with images quite a bit when we get into deep learning is a 3D array because it it is an array that has as its elements these matrices that are that are pixels pixel matrices right of different resolutions that's a 3D array is a collection of images because these are each 2D. That's a real so images. These are all images usually are in a 3D array. Uh all the rays in the same dimension needs to be of equal shape. Yeah, they do. They So, so yeah, we saw that example earlier. Like if we try to So if we try to change the shape like even in a 2D sense, if we got rid of this, this would be an error, right? This gives us an error. We can't we need this to match the size, right? We need this to match the size in order to build the array. Yeah. Okay. Do we get do we see what size is? Size just gives us the total number of elements. Okay. A total number of elements which is really just a multiplication of the shape. There's three rows, three items in each row. So a 3x3 is nine total elements. Okay. So that's the total number. The size is the total number of elements. Now the the dtype is telling us what every member's type is. So maybe not that interesting, but this is kind of the default. So this gives us what each elements each element's data type. So, uh, we can, um, grab, in this case, they're all integers, but they're in N64. Um, so we can also grab how many bytes each one takes up, which is the item size attribute. This is each element's uh memory footprint, each element's memory size, which is going to be uh eight bytes. Um and we can also get if we want to very rare we would actually need to access this but we can actually get the uh um memory reference memory reference for the array data. So array.data we we won't really ever need to worry about this but uh this is actually really important for pandas to be able to access later on because everything in pandas is built off of numpy. It needs to manipulate the raw memory uh often in order to do different calculations with that data. So it typically will need access to that uh data attribute but we generally will never need to know what that memory address is. Okay. Any questions about these attributes? I think the one that we'll use the most is probably going to be shape. We'll probably worry about the shape the most uh when we're working with numpy arrays. Any other questions about them? Okay, perfect. Okay, let me show you a couple functions. Uh, all elements in a shitty array must have the same data type. No, they don't have to. Just like a list, they don't have to. When we're working with data, they typically will, but they don't have to. Yeah, you can have that. Try it out. Try it out for yourself. Yeah, you can definitely have that. All right, I want to show you a couple of functions we can do to manipulate the shape of an array. So, the first one is we can actually reshape an array using the array.resshape reshape function. So this is a function that we can pass in. So we can do arrayshape and then we put in the uh a tupole with the new shape. So in this case we're putting in uh 4, 3, which is to say we want to take this existing array that is a one-dimensional array. So notice this is a 1D array and we want to turn it into a two-dimensional array. Right? that is 4x3. Okay, 4x3 meaning there should be four rows and three columns. And do but first of all, let me ask you guys, do you think this is even possible? What do you think needs to be true in order to reshape this properly? What do you think? If I want to take a a flat 1D array and put it into something that's 4 by3, how many elements do I need to do to do that? Perfect. You guys are right on top of it. 12. Perfect. Yep. I need 12 elements. So, what happens if I don't have 12? Do we think this is going to work? Yeah, let's try it. Error. And look at what the error tells us. I cannot reshape something of size 11 into shape 4x3. It even tells us directly we can't do that. So yes, that is definitely a prerequisite to using reshape is that you need this total number of elements to match the number of elements that you start with and then it will work. So, if you're going to use reshape, you can reshape into any shape that you want to. So, we could even reshape this into 3x4. That would be okay. We could do 3x4 cuz that totals up to to uh 12. We could do 2x six. That would be okay. But could we do 2x7? No, we don't know how to reshape something that is 14 into uh into into a shape 12. We only have 12 items. 2x3x two. Sure, we could do that. That'd be a three-dimensional. We could do that. And now we have a 3D array because we have uh each element is 3x two. We have two of them. So each matrix is 3x two and we have two of those. So we could do that. So that's what reshape does is reshape can take an existing array and move it into a new shape assuming that the shapes align properly. So reshape can do that for us. So that's actually incredibly useful. We'll use reshape from time to time. And then we can actually do the reverse of reshape. So we can always take something and flatten it out into a 1D array. Okay, we can always do that as well. So the flatten function can take something and put it into So this will always always give us a 1D array. So no matter what shape we start with, we can flatten it out into a one-dimensional version of it. Okay, by using the flatten function. So this this does a particular reshape that will completely flatten the array. So you can see it takes this uh three-dimensional array here and goes ahead and reshapes it or or it basically flattens it right into this exactly flat uh onedimensional array. So flatten always returns a 1D array. pretty pretty straightforward. Is there any benefits? Yeah. Sometimes we want to take something that is in one shape and move it into another because we're going to manipulate it. Uh we we're going to assume it has a particular shape to manipulate it. We'll we'll see that later on when we get into deep learning. Especially when we work with images or text, it's going to be important to reshape things from time to time. So maybe not not not maybe not this second but we get into deep learning we'll we'll go ahead and reshape. Okay, one more I want to show you and then we'll take a break is that uh there is a transpose function. Now what this does is it swaps rows and columns. So it it transposes uh this into uh whatever was our rows. So this one two three now becomes the columns. And so this was a 2x3 matrix. It now transposes into a 3x two matrix. So transpose swaps our rows and columns. And that can that's going to be useful down the road too for different uh algebra calculations. We may need to do may need to transpose from time to time. Okay. Any questions on any of these uh functions? Hopefully they're they're not too bad. They're straightforward. They're just different reshaping. We have reshape. We have flatten. We have transpose. They just change the shape of the array. So, we're going to start with doing some arithmetic operations. Uh so just to show you guys that when you have data in numpy arrays you can do elementwise operations meaning that we can do operations that go element by element match them up and do uh some type of mathematical operation between them. So things like addition, subtraction, multiplication, division, we can do those uh between elements. Um, so for instance, we have these two arrays of the same size, the same shape, and we can go ahead and add them together. Meaning that like this uh position is going to be added to this position, this position is going to be added to this position, this position is going to be added to this position. Okay. So when we do that, we get um basically 40 in every slot because we get 30 + 10 is 40. 20 + 20 is 40 and 10 + 30 is also 40. But look at the syntax of it. There's actually two different ways to do it. You can do the numpy.add and then you pass in a and b. So this is the um this is one way to do it. One way to add is to use np.add and then you pass in your array one and array two. So you can do that. Um and so we we add those two and and store it in the result. And notice that the result is a same shape array but just with each element added together from the original arrays. So that's one way to do it. The other is you could just do regular uh arithmetic. So you could just do a plus b. This is an alternative. Alternative is to just use standard arithmetic arithmetic operations. It it doesn't really matter which one you do. I've seen both. Um both will result in the same kind of array. So we could store um we could do something like result equals a + b and then um store that in the result and then print um print the result. So it's same thing as before um we get the same array. So you can do either one np.add a plus b either one will do that elementwise addition uh between the elements. Okay, pretty straightforward. Um, we also have the same thing for subtract, multiply, and divide. So, for instance, when we have these, now we have a 2D array. So, this is now two-dimensional. And but the same exact thing is going to happen. We're going to go through and subtract. This is going to do um this is this is the same as a minus b. So this takes a and subtracts b from it. So we have 30 minus 10 is 20. 40 minus 20 is 20. 60 minus 30. Um and then we do 50 minus 40. So we're subtracting those elements in the same positions to get a uh to get a result um to get a result of uh this 2D array that is a result of subtracting every element from the original arrays. So again you could do a minus b, you could do np.subtract um either way should work. Uh 2D plus 3D could you do it? Well, uh, you could try it out. So, this is a 2D. Um, we could try it out. So, let's copy this guy. Let's do Let's do this guy, which is going to be a Let's see. So, let's do uh a 3D array. So, let's do um let's do one of these guys. And then let's do another one, but let's just let's just change up the numbers. So, let's do 10, 15, 20, 25, 30, 45. Let's do um MP.subtract subtract a and b. Let's see what we get. So we actually do get a result and uh the reason we do is something called broadcasting which is um an interesting idea in numpy that what they do is they basically will force the shapes to to be aligned when you do a mathematical operation but you might get some unintended consequences of doing that. Um for instance we get like we get some of these uh actually work where we get 30 minus 30 40 60. So we get these zeros here for this first guy. But notice that we basically take this and apply it to this. It basically takes this and and subtracts to this guy secondarily. So we have 30 - 10 is 20 and then 30 - 15 is um uh 15. Sorry, 40 - 15 is 25. Um and so the shapes the dimensions don't have to be the same. dimensions don't have to be the same but you can get uh uh results. So be be careful is the thing I would say be careful of doing the subtraction of numpy will try to force the results to fit by taking this and applying it to this 2D matrix here. Right? So it it can work. You're just going to get um maybe some unintended results that don't really make sense but are possible in numpy. If you subtract a bigger value from smaller it will no it'll give you a negative. Yeah it'll just give you a negative. So like try doing in this example what would happen if we put B first you get negatives. So yeah it's still it's still possible. You just get negatives there. Okay. So as well so we can take uh like 30 * 10 20 * 20 10 * 30 and we can get uh those multiplications. So if we do that we get this array which is going to be 300 400 300. So these are just elementwise multiplications. Again, the alternative alternative is to do a * b. Um, and that would be the same thing. So, we just did result equals a * b. Uh, that would be the same thing. But we can do mp.m multiply to make it more um make it more uh explicit the operation we're doing that it's a times b. And this is not to be confused with matrix multiplication. So, uh that's something I should call out here is that matrix multiplication is a different matrix multiplication is different and will be covered later. So traditional matrix multiplication will be covered later in that uh that requires the matrices to be compatible and uh it's a completely different operation than doing elementwise multiplication. Okay, which is just a star b or asterisk b or mp.m multiply. We can do division. So um notice that this is again a scenario where the shapes are not the same. We have a 2D array and we're dividing it by a 1D array. Now what happens is we basically take this and divide it by this and then take this and divide it by this and get this second row. So this size is going to basically match the larger of the two shapes. So this first shape is two-dimensional 2D. So basically the shape of this is um 2x3 shape and this is only a um basically a a 1x3 uh not even it's just a three element shape. Um it because it's just a 1D array. So it's a 1D array. Um so therefore uh therefore um when we do the division it's going to do that broadcasting um thing that I mentioned earlier and try to force this to be able to divide by this. And the way numpy will do that is say okay which is the bigger shape this is the bigger shape because it has more dimensions. Try to take this and divide it by these guys. What do you think's going to happen if we reduce this size? If we did this? Do you think this would work? If I reduce this down to only having two elements? Do you think this division would work? We could try it. Error. It's It doesn't broadcast that way. No. So, you could even see it. It even says it tries to broadcast, but it doesn't know how to. Um, so the even even with broadcasting it the shapes still need to align to some degree. So, this still needs to be like um so needs to be a valid shape to be able to broadcast to each one of these uh dimensions here. Let's try it. Yes, we can divide B by A. We're just going to take now this again is going to try to match the shape and just do this divided by this. this divided by this. But remember that um a / b is not the same as b / a. So not the same result. Yeah, you can get a division by. So uh try making one of these zeros. So try let's say this was all zeros and we did a divided by b. Do you think that'll work? Do you think this will work? Yeah, it basically now it it technically it technically returns a result, but they're all infinity. Yeah, it basically says that we have a warning. We cannot divide by zero. Um, so it basically says that uh sure you could tech you could do it, but you're going to get infinities all over the place and it's not you get a warning that you're dividing by zero. So, it's kind of like an error there. Okay. All right. So, let's go to doing exponents. So, we can do uh elementwise exponents where every element in A is raised to a power of something in B. So for instance we like 2^2ar we do 2 cub 2 4th 2 5th 2 to the 6 and you could see what each one of those would be is this. So this just does um every element in a each element in A gets raised to the exponent of the corresponding element in B. So we get 2 ^2 as I said 2 cub 2 4th 2 5th 2 6 um that may be useful from time to time. we may have a reason to take elements from one list and expo exponentiate them from another list. Um so there's a power um function here to do that. Um so that may be useful. Uh before we move on to the statistics functions which are going to be really interesting, any questions about the arithmetic? Do they kind of are they kind of straightforward? It kind of makes sense in terms of the arrays. Any questions on them? Hopefully they're they're very much like just regular arithmetic. They're not too bad. Hopefully Can you build arrays from set of arrays? You mean like a Python set? Oh yeah, you can. Yeah, you can build arrays from arrays. Yeah. Uh, we haven't done that yet, but I think that's coming up shortly. Yes. But you can Yeah. Yeah. It's not too It's not too hard to do that. So, in fact, I can just show you a quick example of that. So we could do um we could do we could call a numpy array uh from an existing uh so we could have inside of a a list we could have mparray um and then we could have one two three and then we could have um an MP array and then we could have um four five six. So this should make So if we then we uh let me display x. So this this makes a 2D array. Does this make sense? Like I'm I'm defining an array as the input elements to build an array. So yeah, you absolutely can do that to build a new one. >> Yep. Okay, good question. Any others before we go on to the statistics functions? All right. So, what's great about NumPy is with the arrays, it can easily do statistics on arrays. So, it can find the medians, it can find the average, it can find the standard deviation, it can find the variance. Now I know we haven't uh technically defined what each of those are but that's okay. Um it's useful to know that given the different statistical functions that we may want to do numpy can easily compute them on collections of data collections of arrays right or data that's inside of an array. So for instance, if we have this 2D array here, we can compute the median of all elements in the array, which would be np.median. So there's a builtin function from numpy npmedian. Um so we can do npmedian um finds the median of an array. Okay, so this will calculate the median. So if you're uh again we haven't we will get to what the median is later on in our statistics overview later on but the median is kind of the 50th uh percentile like the middle element right the middle element of a we we basically order it from least to greatest and find that middle element um so in this case the middle element is four out of all these elements that we have okay npmedian uh we can take the average which is the mean. So that's very nice. We can take the average um so we can do np.m mean which will take the average of this array. So it'll average all the elements uh together and we get an average of 6.3333. Um so that's very convenient for us that we can compute an average of an array. I want you to now this seems really straightforward but I want you to see how powerful this is. is that lists for example do not have this ability. There is no built-in mean function for a list. Um so that's why this is so useful that numpy has this ability and it's really optimized. It's really fast to find the mean, really fast to find the median. Um it's a really optimized function to do it. Um if we wanted to find the average, we would have to do it manually on a list. we'd have to total up all the elements and divide by the size of the list. Um, not that that's hard to do, but it is something manual that we would have to define. It does not automatically exist uh like what we see here with MP. As a simple function that we can apply and uh we can apply it to numpy arrays very easily to compute the average. So, same thing with standard deviation and variance. Those are uh more advanced statistical functions that um figure out the spread of the data away from the average. Uh again, we're going to learn about these later on, but um MP. STD does the um standard deviation and then var does the variance. Um so, and and really the standard deviation is the square root of the variance. So if you to if you took this um variance sorry if you took the standard deviation and just uh raised it to the second power you would get the variance. Um but uh you can compute them separately this way. Okay. So pretty convenient that numpy provides those statistical operations. We're going to be using these quite a bit. Um when we especially when we do like uh exploration of our data and we want to find an average, we want to find how what the standard deviation is. Um this is going to be really useful to use the numpy function to compute that on a on an array of data. Okay. Really useful. All right. Any questions about these guys? I mean, they're pretty straightforward, but I know and I know we haven't defined exactly what these are. We will later. So, no worries if you're wondering exactly how to calculate these. We'll talk about that later. But, um, any questions about the numpy functions themselves? Okay. Very good. Very good. Okay, let's talk about uh percentile then. So, numpy also has a percentile function. Um, so we can take an array and compute the 50th percentile, which would be the uh median. Again, we haven't learned what a percentile is, but if you think about uh ordering all the elements and figuring out like the median is at the 50th percentile and then um half the data is below that. So there's a point where like 25% of the data uh is below this certain value and then 75% of the data is below this certain value or um 95% of the data is below this value. Um so the percentile is something that ranges between 0 to 100 we can take uh so like the 99th percentile means that most like 99% of the data is below this certain value. Okay so like 99th percentile means that 99% of the data falls below that value if we ordered it and kind of sorted it that way. Um so uh we can compute any percentile we want by just passing in the array and then giving it a a number between 0 to to 100. Okay. So and this should be a whole number 0 to 100. Um so for instance we can compute the 99 percentile. Um, oops. I have to actually run this. Let me run that. There we go. So the So 99% of the values are below 22.8. Um, which kind of makes sense because most numbers are pretty low. So there's really only one number below that number, which which is the 24. So if I did if I did the 100th percentile, that would basically be the max. right? Basically be that 24. So only only 100% of the numbers are below the max. So it kind of makes sense. And then the if I did the 50th percentile that would be the median. Half the values are below that half are above. So that that matches the uh median that we found here which was four. Okay. Then we can do percentile. Uh any questions about percentile. Okay. All right. Finally, uh I wanted to mention that numpy you can manipulate strings in numpy. Now, this is uh less often used because typically when we're working with numpy data, we typically don't have strings inside of there. We usually are dealing with numerical data, hence why it's called numerical python numpy. Um but it is possible to work with strings and do different string manipulations. Um so uh for instance if we have a numpy array that has two strings hello world and then another array that has welcome learners. So these are two 1D arrays. We can actually concatenate um elementwise strings by using the MP uh character module. Um and instead of doing the the the reason it's inside of the character module is instead of doing like a numerical addition so typical arithmetic it's doing a string addition which is concatenation. So it's doing character addition um character addition here which will um concatenate these two strings hello and welcome. So those end up merged together concatenated together and then world learners uh get concatenated together. So this is the uh string concatenation uh elementwise string concatenation. So, it's from the MP. Char or character module um within NumPy. Okay. Very rare. It's very rare we would have to do this, but I'm just pointing this out that it does exist. Okay. If if for some reason we need to manipulate strings, uh we will have that ability to. Okay. All right. So um then we can replace uh substrings with new strings. So uh if we have this original string called hello, how are you? Um we can print it out and we can replace uh we can use a character replacement to replace within this string replace hello with hi. So once we do that we can uh print out the new string. So um then we get hi, how are you as the new as a new uh string. So this does a um string replacement um if we can find the uh substring. So if hello exists. So uh we should test this out and see like uh is there something uh so we could just put in something that doesn't exist in there. um you know it's not going to be replaced. So if we this is saying okay let's let's try to do a replacement of this string let's replace something with high but something doesn't exist so it's just going to return to us it's not going to replace anything it's just going to return to us of the original uh string but this does exist so that's going to be replaced with hi and do a string replacement um if we want to we can also uh manipulate strings to do upper uppercase everything, lowercase everything. You can see how those uh like this is all lowercase but we doupper pass in the string it will uppercase everything. Um this is the this is all uppercase we can lowerase everything. Um you can see how that all works. Okay. Again, very rare we would ever need to manipulate strings, but if we did, there's this character module that can uh help us manipulate strings. Very rare that we would need to because most of our data is going to be numerical inside of an umpire array. All right. Any questions there. All right. So, just to recap, we have our arithmetic operations. Basic arithmetic between numpy arrays. We can do pretty straightforward. Um, we can we can even use our standard. We can use MP add or we can use a plus b, np subtract, a minus b, multiply, divide. We can all use those basic um arithmetic operations. We also have a power function which will raise things to exponents. Um these incredibly useful. We'll use these all the time going forward are the statistical functions like average, standard deviation, variance, median. um we'll be able to compute really easily on an array. Okay. All right. So, let's go to our next notebook and continue working with NumPy. So, go to the 3.04 notebook. All right. Now, the whole point of this one uh is to practice accessing data within the notebook or sorry, within the numpy array. And truly, this is going to be great because it's going to work exactly the same as a list. We're going to be able to access elements by their position and also slice just as we did with lists, right? So, everything's going to work the same, which is going to be really nice. Um the one unique difference is that with numpy arrays we can have multiple dimensions. So that's where we need to actually be careful is that if we want to access elements that are in different shapes like they're in a second row third column position. How do we do that? And it's actually going to be really easy to do um if we just think about it as kind of a coordinate of passing in like an index as if it was a coordinate of this is exactly what I what I want to access. Okay. All right. So if you take a look at this picture, this is a really great picture to break down a 2D numpy array. So imagine we had a 2D numpy array whose shape was two rows by three columns. So it's a it's a 2x3 shape. All right. And these we have elements 1 2 3 and four five six in our array. So we have two rows. Each row has three elements. Okay. So a 2x3 shape. Now the element that is right here is at index zero because it is at the first row. So it's at index zero in terms of the row, right? So so there's two rows. So it's either going to be index zero or index one for the row coordinate. So it's at it's at index zero for the row. But which column is it in? It's in the first column. So it is at index zero for the column. So this coordinate for this guy would be like if we passed in 0, 0 as the index, right? So if we passed in 0 comma 0 as our index, we could access that element right there. Because what this is signaling is we are at row zero and we are at column zero. Column zero. Okay. So with multi-dimensional arrays we have to be careful about that is that things can be accessed by their coordinate now their their index coordinate rather than just a single index like we saw with list. Right? with a list it was just okay we can grab something at index zero index one index two maybe index minus one um but with a 2D array we're actually grabbing things uh at their coordinate right so row 0 column 0 is this should return this should give me the element one which is this guy right if I were to access that element all right So let's try accessing. So let's take a look at this guy. This guy is now going to be at same row. So this guy is still going to be at row zero, but it's going now be at column one, right? So it's now it's now here, column one. So we should be able to access that two sitting at row zero column one index. So this should this should equal two right this item should be that element two. Okay. And then lastly from this same example we have this three. This three should be we should be able to get from we're still within row zero but we're now at column index two. And this should equal this should equal three. Okay. So in a 2D array, this first entry is what row we want to go to. So imagine like scanning over this grid. What row do we want to go to? Okay, row zero. That's the first row. What column do we want to go to? Okay, column two. That's the last column. That's the third column. Here, let me ask you guys, does that make sense in terms of thinking about it like a grid and a coordinate of how we access elements? Any questions on that? Are the INJ just random? Uh, no. Like in the in this example, it's random. It's just it's just a 2 by3. But uh they represent how many rows and columns we have, right? So I represents how many rows we have. J represents how many columns we have. Oh, sure. Yeah. Yeah, they could be any letters. Yeah. No, we're just using I and J because that that's the traditional indices for row and column. I and J. That's just kind of like a tradition to do that. But yeah, they could be you could you could use any letters there. Yeah. Yeah, no worries. It Yeah, that's just the uh it's kind of the tradition there. Okay, so let's see how this works inside of code. So if we create a let's create some arrays. All right. So let's create some some arrays to practice with. So So we create a 1D array, we create a 2D array, and we're going to create a 3D array so we can practice accessing certain elements. So if we look at the 1D array, this behaves exactly like a list. In fact, I'm going to write that down. This behaves just like a list. the 1D array we can access like the third element or the first element or you know in this case we're accessing the fourth element which is at index three um and that returns to us the four. Um so we're we're doing that just like a list. Um in fact we can also do the very last element um which would be at the minus one index and that is the six. So it it behaves a 1D array behaves exactly like a list. Not much different in terms of accessing the the elements there. We don't need to worry about coordinates because there's no rows and columns. It's just a single. It's basically like a list, right? Very easy to access things. Okay. So and then and for instance, we can add two elements from these positions together. This this adds the position one and position zero elements together. And so that ends up being three. And we can see that because that is uh 2 + 1 which is uh three. So pretty easy to do. All right. So here is that exact uh here's another picture of everything we just drew earlier where we're thinking of the elements at these positions as coordinates. Right? So this the element right here in the first row first column is at coordinate 0 0 and then the element right here is at coordinate 01 02 and then if we go down to the next row this element would be at row one column 0 index row one column one index row one column two on and on and on you know however many rows and columns we have. So we think of accessing elements that way by their row and column index in a 2D array. All right. So for example, let's get the element that's in the first row. So this should be the first row because that is index zero and this should be the third column. This should be the third column, right? Is the index two. So if we go back to our 2D array, it should be the first row, which is this guy, and then the third column, it should be a three. And it is, right? So if we print that out, we get a three. How do we feel about that example? Do we see how we're accessing it from inside these brackets just like we would a list, but now it's a coordinate? Do you see that one? Good. Okay. And uh we also have now we can grab something from the second row and from the uh from the second row and this because this is now one and this is now the second column. So if we go back to our array, second row, second column should be this guy here, right? Second row is this guy. Second column is this guy. So this should be the five, which it is. We print it print that out, we get the five. Okay, that's what that coordinate represents. Second row, second column. All right. Very good. Um, I want to take a look at a 3D array now, which is going to be a little interesting in that we're just going to add an extra coordinate to the mix. So, with the 3D array, we basically need to know which matrix are we talking about. So remember a 3D array is going to look like this. We basically have rows and columns. Row, column, row, column, row, column. We basically have an array of matrices as our 3D array. So the first coordinate is going to say which matrix are we at? Are we at this one? Are we at this one? Are we at this one? Uh, Roberto, I'm just having a little trouble with the con. Uh, if we already know the value we're looking for, then why do we need to use the coordinates? Yeah. So, it it's because we need to get familiar with how to access different elements of our data. Um, for instance, like maybe we we maybe we need to access only that last row or maybe only that last column in a collection of data, which we're going to see we're going to practice slicing coming up in a minute. So, um, yeah, it we're going to need to be able to access entire collections of data within like a within a matrix. So, it's going to be important. It's going to be important to be able to to grab that collection of data from from a large data set. Yeah. So, so it seems I I think I see what you're saying is it kind of seems like redundant to do that right now when we can clearly see exactly what value it is, but in a large data set it wouldn't be that obvious and we need a programmatic way to select those contents. Okay. All right. So in the in the 3D example, what I wanted us to see though is that um notice notice that like the first coordinate is going to say are we talking about this matrix or are we talking about this matrix? So we're actually going to have three coordinates. So this first coordinate is relating to which matrix are we talking about? And then once we know what this is, like this zero would say, okay, we're talking about this first matrix. Then it's the same as usual with the these two coordinates are these two coordinates here are now what row and column within that matrix are we talking about? Okay, so with 3D that first coordinate is which matrix is it? Is it the first, the second, the third, the fourth? Because remember in a 3D array, every element is a matrix. Every element is a 2D array. So the first coordinate is saying which matrix are we talking about. All right. And then once we know which matrix it is, we can use the next two coordinates to figure out what row and column of that matrix are we talking about. So let's see an example. So look at how this 3D has three elements. So if we if we break this down, this is saying that we want to access within the second matrix. That's what this first coordinate is saying. Within the second matrix, I want to I want to access the element that's at row 0, column zero. Right? This is the second matrix. second matrix. So in a in the setup it's like we have a matrix here, we have a matrix here, we have a matrix actually we only have two in our example. So this element means we want to look at that second matrix and we want to grab the element that's at coordinate 0 uh coordinate 0 comma 0. So let's let's go back to our matrix and see what that should be. So what is the second matrix? It's this guy. This is the second matrix that is within our array. And then we have we want to access row 0 column 0. That should be this. It should be this seven. Right? So row 0 column 0 of that second matrix should be that seven. And that's exactly what we get. If we print out this uh this coordinate we get seven matrix refers to row uh kind of yeah it's cuz in a 3D in a 3D array every element is a matrix is a 2D matrix Okay. Okay. All right. How do we feel about this three threedimensional indexing? Does that make sense? This is matrix matrix that this is the second matrix and then this is row 0 column 0 of that second matrix. Good. Okay. All right. Finally, I want to mention that we could do negative indexing for all of this, right? So that still applies. Like if we did minus3, that would be, you know, third from the last. So if we go back to our 1D array, we could do minus one, we could do minus3. So minus one, minus2, minus3 should be the four here. Um so we can still use our negative indexing as we would with any list. So that does not change. We can still use minus indexing. Um we can even use it in the 2D array. So if you imagine there this is now in terms of the column because now this is saying in a 2D array we want to grab the element that's in the second row but in the last column. That's what the minus one would mean in this case. Second row, but the last column because we have a minus one index there. So, second row, last column. Let's go verify that's a six. So, second row, second row, last column would be this guy. Yep, that would be a six, right? Because this is this is our second row and this is our last column is a six. So, that's that's a uh that's a nice thing. That's a convenient thing. We don't need to know exactly how many rows there are. If we just want to grab the last one, just a minus one. So this is a second row, last column. And then if you look at the 3D example, we have in our second matrix, the second row, last column again. Um, so we can grab that. So that should be second matrix. Um but then the second row and last column. So second matrix would be this guy. Second row is here. And then last column would be the 12. Right. So there it is. There's that coordinate gets us to the 12. All right. Okay. Any questions on accessing elements? Any other questions about this stuff? Um, by the way, there's some extra practice for you guys. So, give this a chance. Uh, I encourage you guys as kind of a homework for next time to try out this practice uh extra uh practice here that's at the end of this notebook. So, it it just has you go and create some arrays and do some indexing. So, go ahead and try that out on your own. Let me know what you think. Um, can everyone see this? Okay. Should be at the 4.01 uh notebook. Thank you guys. Thank you. All right. And then hopefully you guys have this notebook too um and are able to follow along. So uh as I said we are studying we're getting into our our fundamental data structures here for working with data science and and eventually machine learning and pandas is going to be really crucial for us because it contains two really critical structures um the series and the data frame. So I wanted to take time at the beginning to review what those two data structures are um and and their use cases. Um and then we'll spend the rest of today going through a bunch of notebooks to get some familiarity with those things like series and dataf frame. So we'll be doing a bunch of uh examples of different manipulations and different ways to um work with data inside of series inside of dataf frame. So um we'll be covering the ins and outs of that. as well as some other features of pandas that we'll use along the way from from every now and again. So uh that'll be all of the beginning of today will be pandas if we have enough time we'll get into lesson five which will be around visualization. So we'll start going into you know doing things like plotting and using libraries like Mattplot lib and seabour and those um so but mainly want to focus today around pandas as a really fundamental python library for us um doing anything with data. All right. So, where we had left off last week was that um I told you guys pandas was going to be really important for us and had mentioned that it had these two primary data structures that uh we will use all the time um which are the series and the data frame. So just to review that the series is essentially a fancy numpy array. It has a numpy array that underlies the series that contains the actual data. So think of it as kind of a one-dimensional array. What's special about a series though is that it also has an index to it that can be customized. So it doesn't have to be just a standard like positional index like what we've seen with arrays. You know, they have um natural order to them. They have a zero, one, two, three, fourth, fifth element. um we can actually have a custom index that could be a string, it could be a date, could be uh really any object uh we want it to be. Um so that's what makes a series kind of special is it it has the array of data but with a custom index um which is useful in different situations to customize that as we'll see. Um now a dataf frame just takes that into a second dimension. So, if you think about a series as being kind of a uh I would think the best example of a series is kind of like a column in a table or column in like an Excel spreadsheet. Okay, that's probably the best example of a series. It's got a single column has just a bunch of data. Um, you know, it can it it has a bunch of data and it can have like a row label like you see in Excel, right? So every entry can be labeled coming from row A, row B, row C, row D, uh, etc. That's a series. But a data frame is really the whole table. So a data frame is basically a collection of series um that that a data frame has rows and columns to it. So it's got a bunch of different series. Um, we think of it as kind of a two-dimensional array. data frame is going to be incredibly useful for us because nearly all the data we work with that that has structure to it. Um especially in machine learning will be um in the form of a data frame. We'll we'll put it into a data frame. We'll manipulate it as a data frame because it will have rows and and columns to it. Those columns will represent different attributes of the data. So um think of an Excel spreadsheet as that example, right? You have different columns that represent different things. Um maybe if you're thinking about a bunch of data that represents different houses, um every row is an example of a house and every column is a different feature of that house like the number of bedrooms, number of bathrooms, the price, those kind of things. Um so if I had to draw a picture of these two guys, um again the series is kind of a single array. Um I think of it like a column and those entries are uh inside of the series and they um it does support different data types uh including different objects. So I I see your question Tim. Yeah series can hold different objects. Um can it hold a function? Uh in theory it could. I haven't seen I haven't really seen that but I don't see why it couldn't use a function inside of there. Um but yeah like gen generally it can hold objects like any object uh which everything in Python is an object. So um yeah haven't really seen a use case for it but I'm sure that it could. Um but anyways the series is going to be this one dimensional array and then it's going to have an index like a b c d uh etc. That's fully customizable. So it doesn't have to be the standard kind of positional index. Um could be fully customizable. Now if you that's the one-dimensional that's the series and then if you extend that into two dimensions you have a data frame which is going to be kind of an entire table. So it's going to be um many series kind of stacked um like this and you have rows and columns. So um and then you may have just like in Excel you can have an index here like a b c d etc as the index. Okay. So so series is one dimensional data frame two-dimensional rows and columns. Okay. All right. So let's that being said let's start with the series and then we'll work our way up to extending that into two dimensions. So we're going to start uh by looking at the series. So in general as I said a series is going to have um it's going to have data which is going to be that one-dimensional array of data and it's going to have an index. So and again that index can be fully customizable. It can be by default um just the standard positional index. So just like we would see in a list or in a numpy array, it would be 0 1 2 3 just like you see in this little picture right here. This would be kind of the standard like positional index of of the series. And then this is your this is your data here. Um that is the underlying data of the series. What's really nice about pandas is it actually builds on top of numpy. So this underlying data is actually stored by pandis as a numpy array. And so because of that um series kind of inherit all the nice properties of numpy arrays in the sense that they can do all the same arithmetic. They can be manipulated in a lot of the same ways. They have a lot of the same attributes like a shape um and a reshape function and all and a transpose and all these things. Um so so that's really nice about really series and data frames. The underlying data is all built on top of numpy um which is nice. So you can make now you can create a series um out of uh a list. You can create it out of a numpy array. Um you can make a series out of um a tupole. You can make a series out of a lot of different data. It's just that one-dimensional array that has that special kind of index to it that we could we could customize. So, we're going to see different examples of building a series and manipulating a series. Um, the other thing I should mention is that there's no restriction to the kind of data that the series can hold. So, it can hold objects, it can hold strings, it can hold integers, floats, um, really all of those. And it can even be mixed. So just like just like in a list um we would see you know we can have strings we can have floats we can have those kind of things. All right so let me show you an example and the first place we'll start is by importing pandas and this is just like we did with numpy. Remember all of our numpy examples had import numpy as mp. All of our examples here with pandas are going to be import pandis as pd. This is something we're going to get super used to as we go along is is the fact that we see pd. That instantly should tell us we're using pandas. That's the industry standard. So industry standard alias for pandas is pd. So if you're reading some code and you come across a PD that is just short for uh pandas and that that's pretty that's the standard in the industry. So everyone would understand you if you were using PD there. Um so we import pandas pd. Look how easy it is to make a series. We just start with a list. So we just have a regular list of 1 2 3 4 5 just those those five numbers. And we create a series by doing PD. And we pass in that data as an argument to the PD.series function. This constructs a series and we store this as a series uh object. So what's interesting about this is when you first create the series um if you don't specify an index. So if you don't specify the index, um, pandas uses the default positional index. So it just assumes that the data is going to be, uh, indexed by position. So that's just um, you know, this one would be at the zero index, this two would be at the one index, this three would be at the two index, and on and on. So it's just the standard just like we learned how to slice just like we learn how to access elements by their index. That's all that's all we would need to do to access these items of the series. Um so that's if you don't specify an index. Um but I want you to see this example here where we use the same data but now we actually customize the index to be you know special labels for every item. So for instance, a sorry uh one would be indexed with a two would be indexed with b three would be indexed with c and and so on. So the way we do that is when we create the series, we actually explicitly pass in an index array which signals that um these elements are going to all all this data is going to be indexed by this specific index. Um and then why that's relevant is because when you print out this series um you can access things. Look at how down here we actually access things according to this B index. Or obviously we could pass in a C or a D or an E and that's how we would access things. Okay. But in the in the uh series that does not have uh that just has the default index, notice that we can indu just positional index. So we could access the first thing, we could access the second thing and then this would access the third thing which is at index two. Okay. So um and and by the way you can combine these. So you might be wondering well we we have some data and we have an index. Why don't we just map those two together? And we can. So if we use a dictionary, we can actually combine those two together and say, okay, A should map to one, B should map to two, C should map to three, D should map to four, and E should map to five. And we can create a series from this dictionary. So you can actually build a series from a dictionary. And what this will do is it will treat all the keys as the index uh values and it will treat all the things that they map to. The values, right, will be the data. That'll just be the data, right? All of these 1 2 3 4 5 will be the data and this E D C B A will be the index. Okay? So when you do this, when you use a dictionary, keys become the index and values become the data. Does that make sense? When we're creating a series, any questions on that? We can use a dictionary. We can just use a regular list. You can also use a numpy array. It doesn't have to be a list. You can use a a regular numpy array to to create a series. So all the code here is just creating a series. Nothing that interesting going on so far. Just creating a series. And I want you to see if we were to print out I'm going to add this in here. Just one little cell below this. I'm going to print out the series so you can see it. Let's print the series. You can see that it it's going to have notice how the series prints out with kind of two columns. It this is indicating that this is the index 0 1 2 3 4 right it's the positional index of this regular series and then here's the data the 1 2 3 4 5. Okay. So you you see both of those. Let's print out this series um with index. And let's print that guy out. And you can see now that the index is a b cde e. That should make sense. This is kind of like an Excel column, isn't it? Because Excel, if you're familiar with those spreadsheets or even in Google Sheets, right? Every row has this index A, B, C. Like you're you're dealing with cells that are in row A, B, C, D, and on and on, right? Um let's take a look at uh series with um dictionary. This should be this should be series from dictionary and we get the same thing the ABCDE E 1 2 3 4 5. What is the difference between dictionary series? Well, a series is a different object. A a series is a a pandas object and therefore it has certain functions available to it like finding an average um plotting finding a a mean or sorry finding a median like it has a lot of builtin functionality as a series that you don't get as a plain dictionary. Does that make sense? Like you get a lot of we're going to see this in pandas. You get a lot of functionality if you're a series. You get a bunch of statistical functions that can be easily ran. You get um uh you can do a bunch of summarization on on that data. Um you can do plotting against that data relatively quickly as a series versus as a plain dictionary. It would be it would be object. Try it out. So try it out if you So if we did um so let's say we did a string inside of here. So let's make that three a string. So we have integers and strings. Let's now print this out. They basically, do you see how they're all objects? Which means that pandas would actually convert everything to a string. Once it sees one string, it's going to treat all of these as a string. So if you see a data type as object, that that likely means everything is a string because string is technically an object. Yeah. So we can convert that back to convert that back to an integer. Rerun that goes back to an integer. Okay. Okay. All right. Let's see what we get. So I think it's a valid question. what's the difference between a dictionary and a series? Because it it's getting at like what's what do we get by having a series that we wouldn't have if we just had the dictionary? And it turns out that we get a lot of different functionality out of the series that we wouldn't otherwise have. Um, so we're going to see some of those functions. The first couple of functions I want you to to take a look at are right here, which are the head and tail functions. So what these do is they give you um a a view or essentially a copy of the first n rows. So whatever in whatever value you put in here, you'll get those first few entries. Um head goes from the top. So it goes from the beginning of the series um and does the first uh however many rows. So if you don't put a value in there and you just have this, the default is five. The default is the first five entries. Okay? If you don't put anything in there, you could now you could put a value in there like 10 and you'll see the first 10, right? Or you you want to see the first three, you put in a three. So um head gives you that um copy of the first three elements and um tail gives you on the reverse end it gives you kind of that slice. Think of it as like a shorthand for a slice. It it gives you a slice of those last five guys. So this is almost like doing this is similar to doing series and then we were to um slice out the like last five entries. Think of it like doing something like that. And then this is similar to this is similar similar to series and then we do the first first five like that. Okay. So head and tail um very uh very interesting uh functions that actually we will use quite a bit with data frames um particularly to just get a quick view of the first first five rows uh and last five last five rows with a tail. So if we were to run this um we could print out uh we could print out that. So we to print out the first in rows and we get the first five entries, right? Um, we could actually do the first two and that would just give us the top two, right? From from the top. Now, we could print out the uh last last two, let's say. So friends, last and rows and this does the last two, right? All right. So head and tail pretty straightforward. They give you the first either from the beginning first two or from the end the last two. Um series have a shape to them. um which makes sense because the underlying data is a numpy array which also has a shape. Now let me show you what the shape is. Let's print the dimensions. Print that. And let me comment that out. So as you can see this series only has five elements and it's basically one-dimensional. So even though it has an index it is one-dimensional. It doesn't have rows and columns. It basically only has rows has five rows of data. Uh it's it's just a single column essentially, right? It doesn't have multiple columns. So we just say it has um a shape of five. It's just onedimensional data, right? So a series a series is always 1D onedimensional. Okay, it's always onedimensional. All right. Now, let me show you a really awesome function called describe. Now, this is a cool function to uh give you a bunch of the statistics, basically give you a summary, a descriptive summary of your data. So, when you have, this is the advantage to having your data in a series is you can easily run a describe function and see how it's series.cribe describe. What this is going to do is give us a quick descriptive statistical summary of our data. So, it's going to print out the mean, max, min, median, and then like various percentiles of the data. So, it's going to give us that quick view um of all those statistics, which is really nice. So, let's let's see that. So if we print out the stats um we can see what everything that gets produced. So you can see here we get um this summary. This is our descriptive summary. We get a count of how many values are in the series. We know that there's only five because we created it with only five. That makes sense. The average is three. That also makes sense. Just the numbers one through five. So the average is three. The standard deviation is 1.58. The minimum is one. 25th percentile is a two. The median which is the 50th percentile is three. That makes sense. The upper the 75th percentile is four. Also makes sense. And the maximum is five. That makes sense, right? It's just 1 2 3 4 5 is our data. But that's extremely useful, right? To get that quick. you just run one function, you get all this summary of the different the various stats of that data is pretty useful. Um and and it's actually going to be incredibly useful to do when we have data frames. Um the describe is going to be something we'll use quite a bit on an entire collection of series in the data frame. Any questions on on the describe does that make sense what it's doing? It's taking our series data and just doing a bunch of statistics on it. So we get things like the max, the min, the average, the median. All of those get produced on that series of data. Pretty convenient. Can we get the median alone? Yeah, we can do remember how we had the uh the npmedian function. We can do that. So, if you just wanted the median, you could do um there's two ways to do it. You could do um series median like that or you can do um npmedian and um pass in the series. You can do it either way. If you just want the if you just want the median, you want to compute the median, right? So we can't forget that the underlying data is an umpire. So this function is compatible with the series. All right. Okay. So if we want to get all of the unique values of the series, we can run this unique function which will uh give us all the unique values that are exist within the within the series. So if we um print this out, this should be the values um one through five because we don't have any duplicates. They're all unique, right? So this should just give us um all of those values. Clear that out. And it does, right? So 1, two, three, four, five. Um, this basically gives us a list or or an array essentially of all the unique values. If we if we run the unique function on the series, that's another advantage, right? If your if your data is in a series, you can quickly get all the unique values. If it's in a if it's in a dictionary, that might be harder to do. Might be harder to kind of sort through it and and try to um dduplicate it. So unique um we can also get the number of unique. So if we just stick a n right in front of that unique uh n unique gives us a count. So this should be five. This should tell us how many unique values there are. Um we should be able to print this out and see that there are five total uh unique values. Right? the just the numbers one, two, three, four, five. Pretty straightforward. Uh any questions on those on those functions? We're going to use these more and more. Um I would say head and tail we're going to use quite a bit, especially with data frames. And what's nice is these functions on a series actually carry over to dataf frames uh onetoone. So the head function, the tail function, the shape, describe, unique and unique, these will all carry over to data frames pretty easily when we so data frames are going to be those multi-dimensional um two-dimensional uh uh data structure that will um we will extend our knowledge into after we cover series. Any questions though on these functions? Good. Okay. Fantastic. So, we can keep going then. So, um just like with uh just like with um numpy arrays, we can do arithmetic with series. So, we can add series, we can subtract series. Um the catch being that uh it can be tricky to do arithmetic with series when they do not have the same index. Um let's see how that plays out. So these series have different indices. Remember this series was built with the default index with 0 1 2 3 4 5 uh sorry 0 1 2 3 4 um as the as the indices. And this one was built with the indices that uh have the um the the character indices, right? A b cde e. Um so it might be a question of like what actually happens if we add those series together? Like the data if we were just looking at the data arrays, those make sense to add together because they're they're just arrays that are the same shape. So it makes sense to add their elements together and and that should be a very natural thing. The question is what happens with their their uh indices. So let's see let's see what we get. Let's print that out. Um so it turns out that pandis doesn't know what to do with that because they don't have the same uh the same indices. So this is just a warning that you can normally you can add together series that have the same indices no problem and the resulting data is just going to have that same index as the first two. But um when you add together these have different indices. So just to give a warning here we have a warning. These series have different indices. Um, we can't add them together. Okay, we can't really add them. Now, we don't get an error, but we basically get uh we basically get undefined result, right? It we don't know how to add things together. And look at what the new index is. It it thinks it's adding them two together. we get 0 1 2 3 4 and then a b cde e it's just a mess, right? It's just a mess. It doesn't really make sense to do. So if you're adding series together, doing arithmetic in general, multiplying, subtracting, um all those numpy arithmetic operations we learned last week, you can do them as long as they share the same index. Yes, the order matters. Yeah, for sure. The index order matters because it that dictates the position of elements, right? The relative position of elements is determined by the index. Okay. Now I want to show you something really cool as well which is the um the ability to basically apply a function to every element of the series. So if we have a series we can manipulate every element of it at once by using a special function called apply. So apply does exactly what it sounds like. It applies a function to every single element. So the input to apply is going to be a function. So this is where the the input the or I should say the argument to the apply function is a function and we have a special function here that we have not seen before called a lambda function. So we haven't seen this before. What this is, this notation here actually is declaring a function. Lambda declares a function uh without having to use defaf. So it's a it's a shorthand in Python to do uh to define a function on one line of code. And what it means is we this is defining a function very similar to defaf. And it's saying we have a function that has um one argument and we take that argument and square it. So so that's a function. It's saying take and take whatever argument is input to this function and square it. So what that means is on this series we're going to take every element because those are the arguments to this apply and we're going to square them. So that's what this means. So this has the effect of squaring each element because it's uh star 2 remember is the second power. So this will square each element. That's what apply means is apply some function to every single element. And this is a we're defining a function here to say take every argument and square it. That's what that function is. Now, do we have to do it that way? No. We the alternative to doing this is if we quickly came up here and said, let's define a function called square where we take any x and we return x star 2. That's a function really easily, right? That's a function that squares the thing the arguments. And so we instead of doing this um we could alternatively uh do the same exact thing but instead of passing in this lambda we could pass in our uh square function there. Okay, which says we should be taking the elements of x and uh the elements of the array sorry the series and we should be squaring them. This is the same thing. This is the same as above. It's the same thing as doing this. Lambda is just a shorthand for defining just doing deaf and returning this argument. This is your argument to the def and this is what you're doing to that argument. You're squaring it. Does that make sense on the lambda that this this is equivalent to this is exactly equivalent to this lambda exactly equivalent to that lambda is just a shorthand for this function. Yes, Roberto. Yeah, it's shorthand. So lambda notation is you do you declare lambda as a keyword and then you um write down your argument and you can actually have more than one argument. So you could do x comma y comma z if you have three arguments let's say um in the series we only have one argument which is every element. That's the intention is to apply this function to every element. So there's only ever going to be one argument. Um, so we have one argument and then we have a colon to and then we have our math to say what are we going to do with the argument to return. So the colon is like shorthand for return. Lambda is like shorthand for defaf. So yeah, it is it's always deaf returned. >> Okay. And we always so the other thing I don't want to um the other thing I don't want to get away from either is this notion of apply. This is a really powerful idea. This is a nice shortorthhand to say I want to take every single element of this series and there could be millions of them. It could be millions of these elements. I want to take every single element and and apply some function to every single element and this is the way to do it. So apply is really powerful. Uh we're actually going to use apply quite a bit as we move along and work with data frames. Apply will be really useful to apply a function to every column or specific columns that we want to transform. We will use apply Any questions about apply or the or the the input to apply? Remember the input to apply is a function. Whether you define the function like this traditionally like a function or you use a shorthand lambda. Either way, those are functions. Okay, let's make sure this works. Let's print out. So, let's print out the squared series. Um, let's do both. Let's do the lambda version first. Let's uncomment that. Do the >> see the version in your contacts. Who do you want? >> Um, let's do this. Oh, I forgot to put this. Uh, okay. So, see how this uh let me sorry, let me comment this out. So we don't try to print that out again. Rerun that. Okay. So see how it ended up squaring everything. So there we we took the apply and set and gave it the lambda function and and ended up squaring every element all at once. So everything gets squared. Um and again like that's we could do it this way. Um alternatively we could run it this way and it should work the same. There it goes. So, so the same result whether we use the traditional defaf function uh definition or we use a lambda either way get the same thing. Okay, cool. So, no questions on apply. We we will use apply quite a bit whenever we want to um make sure to transform all elements of a series. Remember in the apply every element of the series is going to be applied to this function or I should say this function is going to be applied to every element of the series. So apply will will transform everything in that series according to this function. Um now the kind of the more targeted way to replace values is is the series map. So this is a little less um this is basically more targeted than apply more targeted than apply. It's it's a um replacement of um values uh within the series. So um essentially what we are doing here is we're going to um take our series and replace this data with this data here. Replace this two that's in our series with this data here and replace this with this data here. And so you can choose which data values you're replacing in the map. You don't get to choose that with apply. apply is going to apply that transformation to everything in the series. All of it's going to be mapped with this function. Whereas map is going to do a specific replacement. So map can be useful like if you only want to transform certain data values. So if you want to find out where there's a zero or a one or a two and replace those with certain um values, you can do this. This kind of does like a find and replace. You think about it like that. Find and replace. So let's see what the result of this is if we print the um mapped series. So you can see that uh one, two and three get replaced but we actually uh don't have any replacement for three and four. So um those uh get mapped to nan because we don't have a specific replacement for them and their original data was um integers. So we now have strings and we had those integers. We don't know how to replace those. So those get overwritten with nan. So the thing you have to be careful with a map is you want to make sure you basically cover all the cases that you have um within your series. So if you have um you know one two three we know that the other data members are four and a five. We should probably pick some candidates to replace those with. So if we have um four we want to map that to four probably and we had a five we want to map that to uh five. So again this is a more specific find and replace. So now all of that data has been replaced and we actually have um those replacements uh that that we can now make. Okay, so map will map these data members to this value and that will happen for every copy of this. So if we had a bunch of ones, they all would be replaced with one. Every instance of one would be replaced with one. Not just like the first one, but every instance. All right. Questions about map. Do we see how it's different than apply? Apply is going to take a function and apply it to everything all at once. We're not picking and choosing what to replace something by, which is what we're doing with map. All right. Very good. Um, a couple more examples. So, uh, beyond applying transformations or replacing certain values with, uh, with something, we can also do things like sorting. So this should make sense to us that we can sort. Um so by the way the default sorting is uh default sorting is least to greatest. So it's an ascending sort. Um, if we want to turn that off and do a greatest to least, we have to go into here and turn that um, we have to do ascending to false. Okay, so if we do ascending to false, let me jot that down. Ascending equals false will do greatest to least. So this will sort our data. So we have one two three four five currently from the beginning to the end of our series. If we do this sort um it's already sorted but if we do a greatest to least it should um put it now as 5 4321 right it should now be sorted like that. So let's go ahead and print that out to be sure that it got um sorted properly. So print sorted series and you can see now we have 5 4 3 2 1 and you can also see the indices now change to where this is now 4 3 2 1 0 because the data retains its index index. So we're sorting the whole series. The index also gets sorted along the way. So um that's something to be aware of is when you're sorting the series you're sorting both the data and the label that label comes along with it or I should say the index not the label the index. So if we were to keep this as true which is the default behavior uh it would just be the the what how we had it was already sorted right 1 2 3 4 5. So in order to do a greatest to least we have to put this to false ascending as false and that will be the the um that will be the greatest to least sort that make sense. Any questions on the sorting that that's valuable right that should be a natural thing we can do is we can sort the series. Um, you can do that in any uh spreadsheet, right? You can sort a column. So, we better be able to do that, right? That that should make sense as a like uh basic operation we should have is the ability to sort a series. Okay. Very good. Let me move on to a couple of really important things that deal with missing data. So we have a a simple function here to check and see if there's any missing values um which is series null. So this will check and see if there's anything missing meaning that we have a blank value or an N value in our data. So we're missing something for that index for some reason. Maybe we loaded in this data from a spreadsheet and we're missing a value. We want to be able to see if we actually are missing anything. Let's see what this returns. So you can see what this this does. This goes through and checks every entry and sees if uh sees us if any of them are true or false. if it's true. So this will return this will return true for missing entries. So if there was a true here, that means that slot in the series has a missing value. It has an nan. It was blank. Essentially, it has an it has an nan. It doesn't doesn't have a valid value there. So it's null. Um so uh this will give us a check for every every entry and um you know more importantly than that is maybe we want to know the the total of them. So let me show you an additional example is we can see the number of missing values which is if we just tack on a sum if we just tack on a sum there at the end. So if we take this and total that up as an aggregation um we can we can get how many missing values there are. So this should be zero. There is no missing data in our sample series, right? It's just 1 2 3 4 5. There's no missing numbers here. So, but we could uh confirm that this should be zero. So, you see here we get zero as the number of missing values there. There is no so so both of these are valuable. Um this is valuable to see which slots are missing. Um uh and we could uh have a place to see how many are missing. Can you insert a null? Yeah, you can overwrite uh so we could overwrite a value to be null by doing something like uh series um 0 equals none. That would overwrite the first entry to be to be a null value none. Does that make sense? This would this would overwrite the first entry to be null. Yeah. None. The the Python equivalent is none. Yeah. By the way, if we do that, look how many are missing. Now after we do that, look how many we get missing. Now we get one, right? So we over we overwrote that first item. Now when we do this, we actually get one. That should make some sense to us, right? Yeah, it's because count is count is usually counting the um number of a of a specific data value. Like count how many zeros there are, count how many ones there are, count how many twos there are. In this case, we're getting a total. That's what sum really represents is how many total nles do we have? So we think we're thinking of count a little differently in this context like it's the count of how many of a certain uh data type we have. This case we're we're checking is null which gives us like a boolean zero or a one and we're totaling that up. So if we have all zeros the total if we add a bunch of zeros we get zero. All those falses are like zeros. So we total all those zeros we get zero. Uh in this case you know when we fill in a null value here um we get uh we get one missing value as the total because now there's a there's a true there's actually a true sitting here. Um which we could see if we um we could see if we printed that out again. If we printed uh series null um there's now a true sitting here which is a one. Basically Python treats as one. So we total that up we get one. All right. So, you may be asking, well, what do we do if we have a null value? Well, luckily, there is a nice function to replace null values with a default value, and that is the that is the function fill na. So, if we're missing a null, if we're missing something, we can go ahead and just fill it with a specified value. So, in this situation, we have a null here. Um, let's fill it with a one. Let's go ahead and fill it back in with a one because that's what it used to be before I took it out and and said it was none. Let's go ahead and fill that NA with a n with a one. And then let's print the filled series. So now this should have that one. This should have that null value filled in with the one. Um, and uh there we go. So we have one there. And um all the values are we have no more null values. So fill na will uh fill back in uh it will fill in any missing values with a default. Okay. So pretty convenient. If we identify anything missing we can fill it in, right? We can fill it in with a default with this fill in a Any questions on this? All right. So what I wanted to show you next is the ability to essentially filter a series. So we can put in conditions uh to to basically filter out uh different rows or different entries in the series. Um and I want to show you how to do that very easily which are just going to use our comparison operators to basically um generate some filters for us. So I want you to see the syntax of that. Um so we have this sample series that we built from this uh fake data. We have uh index a b cde e and we have that data mapped or we have those values mapped to this data 10 20 30 40 50. Okay so some fake data there and we build our series. So we so we have a new series sitting here. Um I want you to see the syntax of how we can generate a filter. So generally we can build filters by um uh by using the uh comparison operators and the brackets for value selection. So remember, we typically will use brackets to put an index in there. That's typically what we're used to doing, right? We put in a zero or we put in an a or we do a slice. We can grab elements that way. So think of doing that but with a comparison filter. Now, so this would like for instance, this says we want to pick only elements of the series that are going to be bigger than 30. Excuse me. So we have um we we uh pretend like we were going to pass in an index here. So we have series bracket because what would what would typically go here would be something like this, right? That would typically select something from the series. It would select the first item or uh or we could slice it to be something like that, right? We could slice. So that's how we select things. But in this case, we're actually going to do a filter, which is a condition to say, I only want to keep items that meet this criteria. That's the filter. Okay. So, inside the brackets is our filter condition. So it's basically um we only keep elements that meet the filter. So let me show you what this should be. If we're filtering out to say only give me items from the series that are bigger than 30, we should only get this part of the series remaining. Right? Everything else should be filtered out because 10, 20, and 30 are not bigger than 30. they're all less than or equal to. So these guys should be left out and we should only be seeing this as the result of that filter, right? We should only see those guys as the result of the filter. So let's make sure we do. Let's print out this selected greater than 30. So here it is, right? We only Oh, it already did it for us. So it already showed us this the selector greater than 30 results in only those two uh only those two um parts of the series remaining. Um so that that was an effective filter right to only give us the items that are bigger than 30. So that's how we filter and so all we have to do is if we want to do other kinds of filters we just have have to do other kinds of comparison operators. So for instance, let's pick all the items of our series that are exactly equal to 20. How many of those do we have? We really only have one. Only this guy exactly equals 20. Right? So So when we do this filter, this should ignore everything else and only give us this item out of the series. So our series is going to be filtered down quite a bit to only this element. And we can see that. So when we come down here the part that um gets printed out there is going to be only that part of the series the the index B and the 20 the data 20 right so that was that is that filter um not equal to 40 which items of our series are not equal to 40 that's what this filter really is asking so we go to our data and look at it we have a 40 here which parts of the series are not equal to 40. Well, it's basically everything else, right? This is not equal to 40. This certainly is. This is and this is as well, which should be everything but that 40 should be filtered out and returned back to us. So, if we if we take a look at that, what is not equal to 40, we get everything else in the series, 10, 20, 30, 50. So, that was an effective filter, exactly what we were hoping for. Okay, so greater than 30, equal to 20, not equal to 40. Those are all valid filters. And I want you to see that syntax where this is inside of the bracket. It's it's basically replacing an index. They say I want to pick something based on a condition, not based on an index. Okay, based on a condition. Any questions on that? Okay. So, I'm going to show you some other uh other examples of conditions. For for instance um we can do we can actually combine conditions using logical uh uh shortorthhands. So this is two multiple conditions. We're combining it is greater than 20 but less than 50. So this is greater than 20. And this symbol here we haven't seen yet is the shorthand for and. It's the boolean and symbol. So this uh this um uh uh amperand symbol is um shorthand for and. So this means that we're looking to filter out data that is bigger than 20 and less than 50. So multiple conditions combined. Now we could do or. There's a shorthand for ore that we could use in here. Um and that is the uh pipe um basically the the pipe uh symbol is shorthand for the logical or so we could have a um pipe symbol here. Oops. I don't know what happened. Yeah, we could have a pipe symbol here which would signal that um we would be looking at this condition or this condition. So things that are less than 50 or bigger than 20, which would really be everything. Um, so it's kind of a useless condition, but we can have that one. So does that make sense that those are just shorthands for and and or. And so we can because we can filter on conditions, there's nothing stopping us from filtering on multiple conditions that are logically combined. So greater than 20 and less than 50. Well, what what should that be? Should really just be these guys here. Those are the only things that are greater than 20 and less than 50. We should just get those two out of it. So, if we went down and looked at it, um, we would get these two guys as the filter, which makes sense. So, we can do that. That's a valid way of filtering. Um, you can also, what's really cool is you can also filter based on a specific list. So you could say I only want to keep values that are inside of a specific list. So this is where we use the is in um function. So this is a function that gives a boolean uh true or false if if the members of this data are inside of this exact list. So what this means is we should really only be keeping values that are going to be 20, 40 or 60. Only those three numbers should be kept. So if we go back to our data, we only are going to keep these ent only these two. There's no 60. So nothing 60 wouldn't match to anything. So we're really only going to keep the 20 and the 40. Okay. So is in is a special kind of filter that um only retains values that are actually in this in this collection. Um so if we did this by list um we only get those two only the 20 and the 40 remain. um you can do string uh filters. So for instance, if we had our data was all strings, we could um filter based on the string uh starting with a B. So that's a that's actually a very specific condition, right? So we want to filter all strings that only begin with B. So in this case, that would just be the banana should be the only one that that is returned. the only entry that's returned. Everything else starts with um A, C, D, or E. So, that's possible. Um, okay. Let me, uh, pause there. Any questions about these guys on this this this one, the starts with Yeah. So, what this means is we have a series that has strings. Um, if we were to uh if we were to um filter it, what we're doing is saying, okay, here's our overall bracket. So, that does not change, right? We have our string series and we have our bracket to say we want to select something. The brackets always meaning let's select something. So first of all that's why we have a bracket. Then this um this accesses the strings. So whenever you want to access string functions you do your series.st str which is the string um module of the of the series. So that gives us access to string functions and one of those functions is finding out if a string starts with a certain character. So that's what the syntax means is give me only the things that start with a B as a as a character. Which strings start with a B? In this case, this is the only one that starts with a B. And you can see that uh based on the starts with the only thing that gets returned is the is the banana, right? That's the only thing that gets returned. Yes, it is case sensitive. It is. So if we did um if we did capital B, this would be different. This would be nothing. Nothing starts with a capital. So yeah, lowercase B is not the same as uppercase B. So we'll continue. Um we were practicing you know selecting items from the uh series and um one of the crucial ways of doing that that's also going to be very crucial for data frames as well is the LO and I look uh which operate on different indices. So what's really important to know is that lo uh allows us to pass in the um basically the labeled index. So our our index that we have which is either the default one or the one we provided when we created it. In this case we created a series that had a index with a b c d and e. So LO will allow us to select multiple of those indices almost like a slice. So so LO allows us to use our index. So we use our series index to select um data. So we're selecting A, we're selecting C, and we're selecting E. So we pass those into a list inside of the LO function. And um we we can pass in uh as many as we want. We can pass in uh just a single value. It doesn't need to be inside of a list. But but LO is pretty typical for selecting multiple elements uh using the index. So that's the key thing to remember about LO. It uses the index of the series to select items and you pass in a a typically a list of those indices that you want to pick. So we're picking the elements at A, element at C, and the element at E. Let's go back to our data that should be 10, 30, and 50. Um, so if we go back and look, that should be at uh those three guys, right? A, C, and E. 10, 30, and 50. Okay, so LO uses our uh indices. And now how is that different than iO? So iOS position. So this always uses position regardless of index. So even if you have a really nice index that you've provided like a date or in this case abcde e we can ignore those and just use their position within the series which is what I look does. It just uses the numerical position of the elements. So um for instance we want to pick the very first item um the very first item uh we know is at index a but in terms of its um position is really just position zero within the series right position zero is the very first item even though it has index a um in this case we're picking the positions one all the way up to four we're doing a slice we can do that we can slice based on position. Um so that's okay with iOS is different than LO. LO uses the index whatever index we have defined which is um it can be special right it can be dates it can be strings it can be objects um that index can be whatever we define it to be will always use numerical positions always. So you would never put in like this would be invalid to do this. This would give us an error. Um we can only use this with lo because that is the index of the uh series. But the positions are just the data positions within the series array. So we have um just a natural ordering that always exists of our data within the series and we can use that with eyeling data but they use different indices. LO uses the the defined index of the series and I look uses the numerical position of the elements. Okay, any questions on those? We see, by the way, this is one all the way up to four. So, what is at position one would be this guy, even though it's at index B. Even though it's at index B. And what is position four? Well, this is 0 1 2 3 fourth. The fourth element of our data. Um, so that would be all the way up to 50, but not actually including 50. That's the slice. So it should be uh 10 20 30. What is four? It's a position. It's just a it's it's this would represent the fourth element in the data. The fourth element in the data which is uh this is the zeroth element. This is the first element. This is the second, third, third. And this is the fourth, right? Numerical position. That's what I look uses. Uh based based on this data, what do you think is at the first position at at uh position one? What do you what data do you think is at position one? If you were looking at this that so 10 is not at position one. 10 is at position zero. 10 is at position zero. What's at position one? Right. It's at 20. Yep. It's 20. So yeah, let me let me explain it one more time. Yeah. So let me explain it one more time. So series have an index. we can define right. So in this case we have five elements. We have five elements and so we have 10 20 30 40 50. And do we agree that the original index we set up for this is A B C D E? Does this make sense for our series? This is what our series looks like. Do we agree on that? So without without worrying about the the positions or the look or the eye look for a moment, does this make sense for our index? It's based this is the based on this data. That's our index ABCDE. Okay. So we we do agree on that. Okay. Now um what data do if you so if you were to ignore if you were to ignore this index for a minute this is the index and just think about this data's positioning from left to right. What index would we say or what position would we say is the 10 at? What position is the 10 in the array? If we were to just think of it as a list, let's say, what position is that? Right? It's position zero. And what position is 20? It's at position one, right? And so 30 should be at what should be what should be 30? Two. Perfect. 40 should be three and 50 should be four. So what I'm what I was saying with those examples is I look uses these positions while LO uses the index. Does that make sense? Now you think of it this way uses those positions of the data, the relative positions of them to select things. So when we say one to four that's this slice right that's this slice of the data uh not including I should say not including the four so it's really this slice yeah iO default indexing los user indexing yes I think that's a good way of thinking about it yep for sure. Go to the syntax. Yeah, let me go over to there. Does it makes more sense when you see it with the picture? So you see the code look using A, C, and E. It's picking those three items using the the defined index. 1 to four is a slice. It's 1:4 will be this slice that I've circled here. 1 to four is this slice, right? It's a slice. And remember that the four really means it goes up to four but not including four. So it's the slice one, two, three. Yeah, no worries. Any other questions on this? All right. Very good. Very good. Okay. So, what I wanted to show you guys is that was the end of this notebook, but there is some extra practice. Um, I encourage you guys to try this out on your own. So, this is your own uh practice here at the end. Try this out on your own. So, this 4.01 know one notebook. Try this out when you get a chance um to do some extra practice accessing and manipulating this data. Okay. So, you have some example uh arrays here. You'll build some series and uh go through and do some uh selecting of some data using uh using the uh different methodologies we just showed with lo and um uh picking different indices and things like that. So this will be some extra practice for you. What I want to do now is go over to talk about the data frame. So we talked we've talked about series so far. We have some different functions on a series. These most of these are going to carry over exactly into the data frame in the two-dimensional aspect. So everything we've been doing mostly been dealing with one-dimensional data, but we are going to deal with two-dimensional data. Let's go over to 4.02 now and talk about the data frame. Okay. So if you have a moment, pull up the 4.02 notebook. We're going to go over to that and talk about the data frame now, which is going to extend everything we just learned over into uh the into the two dimension two dimensional data with rows and columns now, which is going to be the data frame. You guys have this notebook and go over to the 4.02. Yeah. Go over to that one. We're going to do that one next. Okay, so let's remind ourselves what a dataf frame is. So it a dataf frame is now going to be our two-dimensional extension of the series where we are going to have a tabular data structure, meaning it's going to look more like a table. We're going to have rows and columns to deal with now. And what's going to be really interesting is we'll actually have multiple indices to worry about because we have multiple dimensions. So imagine if we had these two series of apples and oranges. So we had a series called apples. It has this data 3201 and it has just a natural positional index of 0123. So nothing that interesting. We have another series over here called oranges that has data that looks like 0372. It also has the same um the same uh index of 0123. That's fine. That's just a default index. Now what we can do is concatenate these into a single data structure called a data frame where now these two series actually form columns in this table structure. So now we have more of a table which is a data frame that contains one row index. So this is called the row index here. So it has a a row index which is row zero, row one, row two, row three. So we can think about the data at a row level. So we can say okay this is one row, this is one row. It's very much like a spreadsheet, right? This is one row, this is one row. um we have that and then we have um then we have uh data at the column level. So we have these what are known as the column index column index that we have here. So the column index just allows us to select columns based on those names. So, we can always go into this table and grab the apples column. We can grab the oranges column. We could grab the apples and oranges columns together and pull back multiple columns. We have that flexibility with a data frame just like you would in a spreadsheet. You can pick data at a column level um as well. So, you always have that uh always have that flexibility in a data frame to select rows, to select columns, to select rows and columns. So that's something we're going to learn how to do is select data based on where where it is, what row it's in, what column it's in. Um select sections of a data frame or rows and columns. What will be the merge title? Uh so that's up to us as a variable. Um the the title of the data frame is not that important. It's more um when we add these two columns together, we get a single data frame with with two columns and we have a row index and a column index. Oops, this is the row index. But the name is whatever we whatever we want to call it. We can label it whatever we want as a variable. Like we can say this is our this is our fruits data frame and just call it that. fruits data frame which has apples and oranges as columns and it has four rows, right? It has four rows of data and it has two columns. Basically, the shape of this is a four row, two column matrix, right? Think about this as a numpy array. That's what that is, right? Four rows and two columns. Okay. So, one thing to point out about this that so I I told you guys it has uh a row index and a column index. So, those will be really useful for when we want to select data from the dataf frame. But the other thing is there's no restriction on the data types in the dataf frame. So the data types of one column can be completely different than the data types of the next column. So like this column could be all strings, this column could be all floats. Um that's totally fine. There's there's no uh there's no restriction there of what the columns can be. They can be uh mixed data types. Um that's totally fine. There's no restrictions on data types. Oh, that's just random data. That's just random. It doesn't represent anything meaningful. It's just random. We just wanted some values to be in there. Yeah, they can be different for each row. Yeah, they can certainly be like they they uh generally um generally could be if let's say I had uh like this data type was a string um so this column was a string this column was a float then I'm going to have like within this row this is going to be a this is going to be a float but this is going to be a string. So yeah, the rows can be different. If you have mixed data types, the rows are going to be different. >> Yeah, we're going to have lots of Yeah, they're coming up. They're coming up. This is just to get us introduced to the data frame. Yeah, we're definitely going to have meaningful ones coming up. We're going to load. So, the most meaningful example I can give you is most spreadsheets can be loaded into a dataf frame structure. So, if you take any like CSV or Excel sheet, they can be loaded into a dataf frame and then we can do all of our analysis on that data frame. That's typically what we're going to do with the data that that is inside of a a commaepparated value file. a CSV file. We'll load that into a data frame. We're going to practice doing that quite a bit. Okay. All right. So before we get to those uh examples, let's talk about how we can create some example data frames um using our code. So here we are going to import pandis as pd and one of the most common ways of um creating a dataf frame is from a dictionary. That's the most one of the most common ways of creating a dataf frame is is from a dictionary. Now why from a dictionary? It's because these keys in the dictionary are actually going to be our column names. These are going to be our columns. the the keys are going to be our columns and then the the data within those columns are going to be inside of this list. Okay, inside of each of these lists. So I'll show us that data frame in a moment but um this is pretty common. So the uh keys become the column uh in indices and the data within the columns are the values which are the lists. So those lists become the data within the columns. So the name has these three strings, the age has these three ages and the salaries have these three uh integers. So let's go ahead and see that dictionary. Um so it actually looks like this. Name, age, and salary. It looks like this. This is our data. Um it has three rows. That's not surprising because it has three every column has three entries. So it ends up being three rows. And here's the name column has Alice, Bob, and Charlie. The age column has those ages that were in that list and the salary column has those those three there. So using a dictionary is one way of building a data frame. It's a very common way. Um so we can do that. Um the other way is to use mixtype list. So basically where you define every row. So creating a data frame from list is possible where we have a list for each row. Um so we have the first row, we have the second row, we have the third row and we actually um manually supply what the column names should be. So this is all the rowle data and this is the column name. So we can actually build a dataf frame off of that using pd dataf frame. We pass in our row data and we say here's what our column should be labeled. Here's the column indices. So these are the column indices. Um we say here's what our column should be. Here's our row data. And that builds the same data frame. Builds the same exact data frame as we had before. just doing it not in a dictionary but in a in a list of list of lists essentially, right? All of our rows are um utilized here as as like this. I agree. Yeah, the dictionary is probably the cleanest way to build a to build a data frame. I I would agree. Probably the cleanest way. But this does build the same exact uh same exact data frame. All right. Same same thing. We can use numpy arrays. Now this is a numpy array of this is the 2D array of our rows, right? So here's the first row. Here's the second row. Here's the third row. And we can give it our column indices as well. So again the numpy array data numpy array data is our rows of the data frame. So uh this data formulates our rows and we're passing that in to build our data frame. Uh it so this should be the same exact data frame that we have uh from before just using numpy arrays to build it. Okay. All right. Let me give you a more practical example that will be realistic to a lot of the use cases we will see over the course of the program when we're loading in data. And this is one that's used all over the industry. Um people use it all the time. I I use this one nearly every day is reading data from a file. So if you want to read a CSV file, you use the PD. CSV function. So read CSV and all you have to do is pass in the file path. So um now this path looks a little wonky because I am inside of Collab and I had to mount this file within my drive. Um so this data is in my Google drive on Collab. But generally that like if you're in VS Code, this would be a path to um the data that is uh somewhere on your computer. So this is just a file path to this CSV file. Okay. And by the way, you guys should have this data. Um, if you go into your uh let me let me show you where it is. So, if you go into your LMS, um, let me log in. Do you have do you guys have access to the data sets? It should be from your reference materials. It should be in this data sets here. Okay, this data sets. So go to this data sets and the reference materials. And that's where I'm getting this house price uh CSV from. So if you download that and extract it onto your machine, um that's where you can get that that's where you can get that file from. So house prices, um by the way, if you have it in your Google Drive, you have to run this code. Um you have to run this code first to be able to access uh files that are stored on your drive in in Google. So, so you want to run that first and that makes your data accessible within your within your So, this is if you're running in Collab. If you're running in Collab, you want to run that uh in a cell. So, you can this makes your data accessible. Let me write that down. This makes your Google Drive data in Collab. Okay. So you have to run that and then once once you run that um then you can access data that's within your drive. So if you go to this like folder icon um you can see you can you can you can browse your drive uh within this folder icon. Uh there is but only for the folder you have open. It's just the like file explorer on the left. Roberto it's just the file explorer on the left but only if you have uh it only in the folder you have open. So what I would recommend if you're working in VS Code is to um I would recommend to put that data set in the same folder as where your notebook is so it can discover that file pretty easily. Does that make sense? Like put that put that data set file. So there's two files in this uh notebook that you're going to want access to, which is the house prices and the Iris Excel. So put those in the same location as this notebook and it should work. So let me actually put a comment here. So, uh, you should be doing PD read CSV and then it should be file path to CSV, wherever that is. So, I think the easiest um I think the easiest thing to do is put it at the same location as where the notebook is saved because if it's in the same place, um then that should just be PD read CSV and then it should be um house price.csv should just be that loads the whole file. Yeah. Yeah. loads all the data in memory. Yep, that's right, Tim. I don't have import drive command on the Yeah, this is for Google. This is for Collab. This is This is only for Collab. You need this to access data on your on your uh Google Drive. They the two files are the house prices CSV and the Iris Excel XLSX. Do you have those two? House prices CSV and Iris XL XLSX Excel sheet. So you can build a data frame from here. Now when you uh load all this data and you print it out, it's actually quite a big file. It has about 5,000 rows. So, it's got it's not going to show you all of that data, but we're going to have some functions we can do that can easily get us some summaries of of this data. Um, which will be nice. Okay. So, we'll have that coming up shortly. But there are 5,000 rows. So, or sorry, about 4,600 rows. Um, so it it's a, you know, decent amount of data there. 4,600 rows. And the iris data also has 460 rows and 18 columns or 4,600 rows I should say. Sorry. All right. Were you guys able to run this? Were you able to uh read the CSV using pandas? Were you able to run this? Not able to uh make sure that the path you have the Oh, that's fine. The output's getting truncated. That's fine. That's expected. It's not going to show you all 4600 rows. That's That's expected. Yes. Are you running in collab? you are okay. Uh you have to give permissions. So there should be like a popup when you run this code. When you run this um there should be a popup to give permissions from your Google accounts for this notebook to access your your files. So make sure that that popup actually shows up and you click and you uh validate it. You should see a popup when you run this. Okay. Yeah. Uh worst case, by the way, um what you could do if if your drive isn't mounting a backup is to go do you see this file? Do you see this files on the left? Um and this is for for everyone running Collab. If you're unable to get the drive to mount, what you could do is you could just upload the file into the workspace. So if you go to this files um you can upload you can hit this upload button to upload data. So you can see see this upload here from if you click on this and then you click on this you can upload the file manually into the into this uh workspace into the notebook workspace. So you could you could do that too if you have the file on your computer just upload it. that works too without having to mount your whole drive. So just click on this upload and then um you can upload the you can upload the file. So So I just uploaded the house price CSV. So now now I have it available in the session. And then I could um like that. What should it look like? Uh it should look like this. You don't have to you don't have to do Google Drive. You can do it this way too. Do you see the way I just showed where you can uh upload it into the session into the notebook session? You can do that. You can keep the file on your local and then just upload it to your notebook using this uh files upload. Roberto, do you see this? Like all of this data displaying. That's how you know it worked. Uh if you rerun the cell, it that means it worked. It should it should display the same thing. Uh if you scroll back up, uh people posted it, but it's in your LMS. It's in the in the data sets here. See this data sets link? It's this one. download the data sets and uh it's we're doing the house price and the Excel and the Iris XLX. Yep. Okay. So, you guys should be able to see this. Now, this is um loading. Let me just take a step back and say this is loading the data into a data frame. So now this is the the contents of the CSV are now in a dataf frame. So we can do dataf frame operations to that data which is going to be really useful. We can manipulate it. We can do summaries. We can do group buys. We do all we can do filtering. We can do all kinds of stuff to this data frame now that we've loaded all that data into the data frame. So read CSV we are going to use a lot during the program. We're going to use a lot to load in data from a file to load in data from like a spreadsheet style file into a dataf frame and then manipulate that data frame. Okay. And there's there's a read excel. Pandis actually supports a lot of different read functions. So it has read JSON, read parquet, read um uh all read um arrow, it has all kinds of different file types that it's uh that it supports. Okay, so uh not just CSV and Excel, even though those are very common, um you know, we can read all kinds of different files in pandas to load them into a data frame as long as they're supported. Okay. All right. So, we uh practiced loading in that data, but now what we want to do is practice accessing data from that data frame. Now that we've seen we can load data in, how do we actually access data and then start to manipulate it, summarize it, do those all all those things we were doing with series, we want to extend into the data frame. So, let's get some practice there. Um let's scroll down and build an example data frame with some uh fake columns here. So we have u column name column one column 2 another column. We're going to practice accessing data from this and then eventually we're going to come back to this data frame with all that uh data we just loaded from the house prices or the iris and practice working with that. But for the moment, let's work with this fake dataf frame built from this. So we do pd.dataf frame that builds that builds it out of this dictionary. Um and here's our here's our column data that we have uh mapped to these different column names. So um if we want to access a column, it's super easy to do. We just pass in that column name here inside of the brackets. So we do df which is the name of our data frame and we do um we just have column name here that's going to access all of the data within a column. By the way, what do you guys think is the data type of an entire column? We just studied it. What do you think an entire column is inside of a data frame? It's a particular type of object. You're right. It's an object. But what what type of object? It's a pandas object, not a string. We just learned about it. We just were manipulating it. Series. Very good. So, a column is a series. Okay. A column is an entire series. Yep. A column is a series. So when we access this column, it's as if we're accessing an entire series. Okay. So if we print out that column, we get this. This looks like a series, doesn't it? It looks the same because that's what it is. It's a series. We're grabbing that entire column, which is 5, 15, and 8. So to access a single column we just use the braces and we pass in that column index name and we pull back all the data from that column which is really a series. So remember um a single column is a series is a pandas series. So, we're actually accessing that series when we do column name. Um, by the way, if we were to try and and access an invalid name, uh, we will get an error. So, let's say we did um some column. Now, this doesn't exist. So, this this will give me an error. So, if I try to do this and that doesn't exist, this will give me an error. And you can see what kind of error I get. A key error. Um that uh this does not exist. Basically um I don't know how to access that column because it doesn't exist. Right? So this is uh we would not want to do this. We would want to use the proper column uh name and that should work just fine. Okay. All right. Now, how about this guy when we pull back multiple columns? So, we can use a list and select multiple columns at once. Now, what is a collection of series? We are studying it right now. What is a pandas collection of series? Perfect. DF. Perfect. You guys got it right. Data frame. Perfect. So when we pull back multiple columns, this should be multiple series which is a data frame. So you guys are absolutely right. When we access multiple columns gives us a data frame. So this is actually going to be kind of a miniature data frame. It's only going to be these two columns because we're accessing these two guys which are these two columns worth of data. That's kind of like a mini data frame. That's part of the whole data frame, right? Selected columns there. So let's look at that. When we access multiple columns, we get these two guys which is a which is a data frame. Okay, we get column one, column two, and we see the data when we have three different rows from from the data frame. So, we definitely can access as many columns as we want by passing in a list into that uh into that dataf frame uh bracket and it will allow us to access uh multiple columns. Now we can't forget about LO and I look. So uh for instance will allow us to select um particular rows based on the index. So of zero is based on the position. So this is saying let's grab the data that is at row zero. So again, this is using the positional index um zero regardless of what the actual index is. Um the userdefined index and data frames can have userdefined index. The rows can actually be a custom index as well um because we know a series can have a custom index. So I look doesn't really care about our custom index. It's going to use the positional index to grab that entire row. It's grabbing the first row. So df.lo is grabbing this entire row which has values of five 10 and 100 um and 25. If you look at it that makes sense. That should be this row here. That's the row that first row we can do filters uh just like we did with series we did filters we can do um so we can do conditional uh filters here where um now how do we do that it's exactly how we would have done with the series right so we do a bracket to signal that I want to access something and here we give a filter. So what this filter does is this will filter rows of the data frame that meet the condition. Okay. So any row where this condition is occurring in other words this column name it has a value greater than 10 we will keep. So we go to that column name bigger than 10. Which ones are bigger than 10? It looks like only this guy is bigger than 10. So, we really should only be keeping um we really should only be keeping this row here should be the only row that gets uh kept because um that is the one that corresponds to where this column has a value greater than 10. Okay, do we see how that filter works? We're really checking. So we go back to the syntax. Go back to the syntax. We're checking where a particular column is greater than 10. Only those rows, only those rows do we keep where this condition is met. So what we should do is go and double check this. So the filtered rows, we only are keeping that one row. So we just filter that one row because that's that's the only place where this guy is bigger than 10. So that's that's very powerful. We do this very big kind of filter operation on the data frame. So you have to think of the filter operation we did on the series. We're extending it into two dimensions. We only keep the rows where this condition is met. Okay? Only keep the rows where that condition is is valid. Any questions on that? only keeping the rows where that condition is valid. Yeah. So why we why do we need DF twice? It's because overall overall we access data we access data by using DF and then the bracket. Right? So, so that should be generally accessing data. We saw that with the series too, right? It's a series then we have bracket. We saw that with lists. Usually it's a list and then a bracket and we pass in an index. So, so we generally have the first df with the with the bracket to signal we're accessing something. Okay, it's just generally we're accessing something. Now the reason we have df twice is because inside of here is ai is a filter condition and the condition says I want to look at this column. So, so we are so the reason we have the data from twice is do you see that we're accessing that column which is df bracket the column this accesses this column and says I want to check where this column is bigger than 10 all the rows where this column is bigger than 10 that's why we see df twice we see it once on the outside to signal that I want to access something from this data frame. What do I want to access? Only the rows where this condition is being met. Only the rows where that condition is true. That make sense, Roberto? Uh yes. We're going to study that coming up. We're going to study that exact thing. How to get a summary. We're going to look at that. Yep. It's coming up soon. Uh that's correct. Yeah. So if we had multiple conditions we were checking, we would see we would be using DF multiple times. Yes, that's true. uh show once again how we can load data from local folder. Yeah. Yeah. Sure. Sure. So, do do you guys see this folder over here? Do you see this folder icon on the left? Click that. Click that folder and it should show uh it should show something like this and then click this upload button right here. That's how you load data into your notebook session. Yeah. So, you can get the path. By the way, you can always get the Do you Do you see the file in there now? Do you see the file in there when you upload it? You can get the path of it by clicking to the right of it, these three dots. You see how there's three dots? If you click that, you can copy the path. Click on copy path and then you can it's the same it's the same code. It's the read CSV read CSV and then you put in your path inside the string like this. which I think is probably the same probably this for you. It's not from your computer. Uh, no. If you're in Collab, no, it's not going to be from your computer. It's going to be from the uploaded file. Are you working in collab? It's Yeah. So, if you're in Collab, it's going to be from this path. It's this path of the uploaded file or your drive path, wherever it is in your drive. Yeah. Just somewhere in the cloud where you've either uploaded it here. Um, and you can always copy that path. Um. Oops. Okay. Where was I? Oh, we were doing the filtering. Okay. Let me ask you guys. Does the filtering make sense to us? It does. Okay, perfect. All right, I have one more I have one more thing I want to talk to you about on the filtering and then we'll take another um little bit of a longer break. We've been going for a couple hours now. Um so once we filter like this, we are we're basically picking all the rows that we want out of the original data frame, right? Because that's what this does. it selects rows based on meeting this condition. Um, but there's nothing stopping us from accessing a particular column after we do that filter. So that's what this if I I'm going to scroll down to the bottom of this cell and go to this example and I'll come back and talk about at IAD and LO in a minute once we come back actually once we come back from our break I'll cover those. But do you guys see how we are accessing this is the filter again. Same exact filter. This is bringing back um so let me copy this and make a comment here that this part of it um gives us the DF rows that meet the condition and then then we access a particular column of these rows. So that's what this does. Do we agree on that? that. So this is saying I've already filtered my data frame to be filtered to this set of rows, but it's still all columns. So if I were to draw this, right, if I were to draw this, we've basically filtered it out to this selection of rows. Just imagine that, right? We've filtered it down to this selection of rows. But what this thing does is says let's zoom in on this column, right? Let's zoom in on that column. That's what that does. So we are allowed to do that in the syntax. Once we filter, we can select a column from the filtered result. We can get basically chain those together, which is what which is what that's doing, that code is doing. Um, yeah, I'll summarize it in a moment. Summarize it in a moment. Okay. All right. So, let me Yeah, let me just summarize the things that we've done uh in this cell. Uh, let me move that. Um, so all we did was we created a data frame with some example data. We didn't even use that data yet. We will come back to it, but we're just using this example data from this dictionary. And we were just practicing some different things like accessing a single column. All you have to do is put the brackets and then the column string which corresponds to whatever column you're selecting. That is going to retrieve the series of data. in this case like the the column right five 15 8 um so it's it's doing that um you can select multiple columns if you pass those names inside of a list so column one and column 2 this is actually going to give us a dataf frame result right because now we have multiple series being pulled back multiple series equals a data frame okay multiple series equals a data frame so we have a data frame result that is giving us only these two columns out of the four columns that exist. Um we can use eyel to to grab a row based on the position the positional index of that row like this is the first row we can do the second row we could do the third row the fourth row etc. So that's just the first row. Um, we did a filter. So this is just a basic filter. We were just talking about this where this is the condition. And so this is saying let's grab all of the rows out of this data frame where this condition is being met. Okay. All of the rows where that condition is being met. All right. Then um yeah different from a slice different this is a filter different than a slice this is spec because this is conditional right a slice is more positionalbased it's not conditional this this only uh selects data that meets this condition this is conditional and then what we said is based on this condition we can actually select another column right after that which kind of chains it together. Yes, it's like a birectional. Yep. So, while you try that, uh let's continue talking about um on this uh set of examples here, we had the uh at um we had just gotten to the at. Now, the at is like the equivalent of LO for accessing a single entry in the data frame. So this is the equivalent of um df.lo uh for a single value. So at will use the uh index um so at uses the index uh uses the row index and the column index uh for accessing So the row index in this data frame is uh just the default. It's just the positional just happens to be that way. Does it have to? No, it could be dates. It could be strings. It could be anything. In this case, we didn't um specify any special row index. Um we we do have column indices, right? Which are these names. So we can use those but there's no special row index with this uh dictionary uh when we create this. Um so uh it just has the default row index. So when we use at um we can say I want the value at the first row which is row zero the second row which is row one um row the third row fourth row fifth row and those are all going to be this positional index because that is the true index of this data frame the row index um however we also have the column index so we can access things based on that. Um so column name we're grabbing the uh element in the column name column but the first row here. So this should fetch us this five because we want to be within this column but the first row. So this should this at um should return to us the uh a five. So if you go down to uh single cell by label this should give us five which is what it does. So, so sometimes you can use at. Again, this you only ever use it if you want to access a single item, which is kind of rare. Um, you don't really you're not usually going to be grabbing just a single item in the data frame. Usually, you're going to be grabbing multiple rows, multiple columns, but um there is the at function for passing in a row index and a column index to access a single item. Yes, but that's true, Romero. But uh it's I would call it the row index and column index in this data frame. The row index is zero because it has the default index, right? We we it has the default index for the data frame which is 0 1 2 3. It's it's uh this this is the row index, right? This is the row index and then this is the column index. column index. All right. So with at we have the equivalent of LO um but for accessing a single a single item. We also have iat which is the this is the equivalent of iO. So this is the equivalent of iO for accessing a single element. Um and so this one we uh use so this one uses the positional uh index for row and column. So we do not use the name for the column. We only use its positional index as far as which column are we in going from left to right. Are we in the first column? Are we in the second column? Are we in the third? This would signal that we are in the second column because that is the column that is at index one. So if you use the positions with IAT or ILO for example um we would be like this is column position zero this is column position one column position 2 three regardless of the actual column index right it has a position within the data frame going left to right every column has a position so I at uses this column position position. It also uses the row position regardless of what this is. It just happens to be the same because the data frame uses the default row index. But um so 01 if we did IAT and then we did 0 comma 1 would be the first row second column, right? First row, second column. So that should be uh that should be this guy. So we look into our data that we have um which is this data frame right here. Um it should be first row second column it should be 15. So we want to make sure that we get uh uh 15. Um so if we run that uh single cell by position don't know why it's saying um finally we have uh df lok. So, if you guys remember, LO uses the uh we we've kind of already talked about this, but LO uses the uh um uses the indices, right? So, it uses the row index and the column index. In this case, we're selecting the first row, but all of this column named column name. So here we are selecting the first row but all of column name uh column we're picking all of those values. So we should be getting first row but um but the whatever is in the column name entry and so that would be uh if we go back to we should be in the first row but only getting this five essentially which is the the value in that column name. So LO will use uh lo will use um the indices just like at would. So at uses the indices um lo also uses the indices. It's basically lo we would use though when we want to select. So we use log use log just like we did with series right we we use lo to select multiple values. You can even slice the rows. So for instance, we can grab like we can grab the first two rows um and slice it that way and that would pull back multiple values for this um data selected with LO. So it's actually going to grab um all of those uh all of those rows um zero and one um and the data belonging to column uh sorry column name. So you can even put it. So so you can use LO to grab uh multiple um multiple entries. So you can have a slice in there which is pretty interesting. So here's a slice zero and one rows. All right. Any questions about these? Uh, we're going to get to some really important summarization functions coming up, but any questions about these, like accessing patterns. I think the most common way we will access data will likely be from picking columns like this or honestly doing filters like this. Those will probably be the two most popular ways we select data. Uh, JD, if they do for the rows, yes, the row because most people use the default row indices. Um, but the columns usually have a label, right? They usually have a name like this, like this example up here. Um, the columns usually always have a header. They usually have a name, right? If you think about an Excel spreadsheet, columns usually have a header to them. But but rows, yes, usually will have a default index like 012, which kind of lines up with their position. Okay. Very good. All right, let me get to some exciting uh summary functions that we will cover next. So, I want to go through some basic dataf frame functions now that we've seen how to access data. Let's look at some other convenient functions. And some of them are going to be really familiar from uh basically from series. They're going to be the exact same like head and tail and sort and is null. Those are all going to be the same that we have except now it's going to be in multiple dimensions. Um so let's take a look at some of those examples. But some of them are going to be pretty unique uh and but some of them are going to be basic uh multi-dimensional extensions of the uh functions we've already seen for series. So let me walk you through some of the uh functions that we are going to do. uh some examples with um the first couple we've already seen which are head and tail. So just like in series we have a head and tail function which which shows us the first few rows of the data frame it. If remember if you don't put anything in the head and tail uh argument like if you leave this blank it will just do by default the first five or the last five rows for head and tail. Now why is that helpful? It's actually really helpful. Often when we load in a data frame, the first thing we typically do is run a head or a tail. So we can sanity check what we just loaded without having to print out the entire data frame. Printing out the entire data frame is usually uh inefficient and we don't like it's going to get truncated anyway. So, a very standard thing to do is to just print out the head or just print out the tail so we can um get that view of the first five or last five rows to sanity check what we just loaded in. So, um that's a very common thing to do. We're going to see we're going to see examples of that. Um and we're going to use head and tail quite a bit as we move along. And you know, as we load in data, that'll be usually the first thing we do is head or tail, just to take a look at the data we just loaded in. Yes, it's going to include the header. Let me show you an example. So, if we go back to um if we go back to the data we loaded earlier, um so the house prices, um let's print out the head. So, df.head head. Um, it shows you the it shows you the header, but it just shows you the first five rows. See how it's just the first five rows, but we can see all of the contents of the data pretty nicely and succinctly. So, so it's a very useful thing to do, right? That when you load in this data, this is very useful to do to sanity check what did we just load? What does it look like? df.head. We can get a sanity check of what the data looks like after we, you know, load it in. And we could do tail. So, um, that will give us the last five in the data set, right? That'll give us the last five. So, we can see, uh, there's 4,600 rows. This gives us the last five of them. Insanity check that too. Usually we look at the first first five. Okay. So head and tail. Now um you were asking Roberto about a summary earlier. Here is a really a couple of really nice summary functions which are going to be uh two that we'll use quite a bit. One is called the info which gives us an information summary of the data frame including all of the columns that exist, all of the data types and if there's any data missing from those. So the df.info function which you see used right here is an incredibly useful function to give us a summary of what is there. Okay, it gives us a summary of what exists. So let me show you what that looks like. on the uh example I just did. So on that house price data set um we looked at df.head let's do now let's do um df.info and print out the um information summary. So if you take a look at this summary, what it looks like is we have a total number of columns. We have a total number of rows. So 4600 entries, that's 4,600 rows. We have a listing of all of those columns and how many nonnull values they have. So you can quickly eyeball and see um which columns are missing data because we're expecting to have 4600 entries in every column. Those are the rows and this tells us how many were not null. So 4600 that's a good number. That means we have full amount of data not missing. That's pretty good. Not only that, but we get the data type. So what this would tell me is anytime I see object, I'm thinking of that as a string. Thinking of that as a string. So the date is just a string of the date. Um the price is a float. Bedrooms is a float. Bathrooms is a float. Um uh squ

Original Description

🔥Data Scientist Masters Program (Discount Code - YTBE15) - https://www.simplilearn.com/in/data-science-course?utm_campaign=PeoFKvDQPuw&utm_medium=Lives&utm_source=Youtube 🔥Partnership is with E&ICT of IIT Kanpur - Professional Certificate Course in Data Analytics and Generative AI (India Only) - https://www.simplilearn.com/iitk-professional-certificate-course-data-analytics?utm_campaign=PeoFKvDQPuw&utm_medium=Lives&utm_source=Youtube 🔥IITG - Professional Certificate Program in Data Analytics and Generative AI (India Only) - https://www.simplilearn.com/iitg-generative-ai-data-analytics-program?utm_campaign=PeoFKvDQPuw&utm_medium=Lives&utm_source=Youtube This video on Applied Data Science with Python Full Course 2026 by Simplilearn, we provide a complete guide to learning applied data science using Python with real-world use cases. This course focuses on applying data science concepts to solve business problems. You will learn key topics like data cleaning, data analysis, visualization, and machine learning. The video covers libraries such as NumPy, Pandas, Matplotlib, Seaborn, and Scikit-learn. You will also explore concepts like data preprocessing, feature engineering, and model evaluation. The course includes hands-on projects and real-world datasets to build practical skills. It is ideal for students, analysts, and professionals looking to apply data science in real scenarios. You will understand how data science is used in business intelligence and decision-making. This course also highlights career opportunities in data science and analytics roles. If you want practical experience in data science, this course is perfect. Watch this video to learn the complete applied data science roadmap with Python in 2026. Related Videos: ✅ 1. https://www.youtube.com/watch?v=mnkiYN6qikw ✅ 2. https://www.youtube.com/live/LGCZ-Fhm48c ✅ 3. https://www.youtube.com/watch?v=S8hG_NXDRz8 ✅ 4. https://www.youtube.com/watch?v=XTwiahmkc_0 ✅ 5. https://www.youtube.com/watch?v=Xhne0Zx
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

Up next
Spreadsheet Guy Meets the CFO: "Define How Much"
Digital Transformation with Eric Kimberling
Watch →