Your pandas questions answered! (webcast)
Key Takeaways
This webcast answers 45 viewer questions about pandas, the leading Python library for data analysis, exploration, and manipulation.
Full Transcript
all right hello folks we are live all right welcome to the event the pandas uh event your Panda questions answered so um before I get into any Logistics or an introduction I just want to make sure you can see me and you can hear me so if you can see me and you can hear me go ahead and and uh in the chat just type uh where you are right now where are you in the world uh let's see Germany San Francisco Miami two from Germany Finland San Fran New Deli Atlanta Tel Aviv Columbus Spain another Germany Connecticut Portugal NYC Minnesota this is great very cool uh feel free to continue to post um thankfully no one's uh at least that is responding is having uh problems if you do run into technical problems at any time during the webcast um first you can try just posting in chat and seeing if others are having the same problem most likely uh you know your your maybe your connection is slow or something like that but um usually the solution with crowdcast is just to refresh your browser or to go ahead and try another browser um people have told me that Firefox works really well I'm using Chrome um but uh really up to you um but that's usually going to be the solution if you have any technical issues okay so um couple logistical items and then I'll introduce myself and then we will get started so um I posted a link in chat which is of course buried now but let me post it again right now and there it is um okay so uh that is a GitHub gist and it just has five lines of code but um it's lines to load for example data sets um and if you want to follow along with the code today which is completely optional um you can just watch if you like but if you want to be coding along at home um I'm going to be using some example data sets and you might as well load them now so that when I start using one you're not stuck on not having the data set and having to figure out how to get it so go to that URL five lines of code run it in your python editor of choice and you'll be basically ready to follow along with any code I use during the webcast okay so that's one thing second thing is if you have not already uh participated in the polls um there is just right under the video there's a polls Tab and go ahead and click on that and there's a couple polls um see uh I'd like to know a little bit more about the folks that are here um and if you uh if you have some more time and want to you can uh scroll down further and look through the questions pick out some that are of interest to you and go ahead and upvote them I will be answering the questions roughly in order of Pop popularity I might get through all of them I might not I don't know um so might as well pick out the questions that you are particularly interested in and have some influence on how I spend uh the time today okay so those are just some legistics um a little introduction about me so uh my name is Kevin Markham and uh for for those of you who don't know I'm the founder of data school I teach data science in python online uh predominantly online and predominantly Python and uh my my goal is to make data science more accessible so that uh you can either launch or accelerate your data science career so that's my overall goal and um I hope that some of the resources I put together like today's webcast are helpful in making that happen Okay um my background is in Computer Engineering and uh I've spent a lot of time time in the classroom teaching data science and now I'm mostly doing it online though I was in the classroom just yesterday in DC um with some uh data science students uh at General Assembly uh giving a guest lecture um uh yeah but I love python I love data science love teaching and um personally I'm 36 years old uh live in Washington DC uh in the United States lived here about 11 years uh and got married a couple months ago so those are some you know uh random things about me so that's it um on kind of intro um just to talk a little bit about how it's going to work today I'm going to be answering questions as you know um I'm going to go about an hour maybe longer if I feel like it and there are still questions left we'll see um or whenever my voice gives out which has been known to happen um what else um just looking over at a few notes on things I want to mention ah okay so uh in terms of the chat um so the chat you're posting in that's good for like commenting on something I'm talking about right now so if I'm answering a question and you have a followup go ahead and post it in chat if you have a new question something that I'm not talking about at that moment go ahead and post it like use the big orange button underneath the video that says you know ask a question click that button type in your question there hit enter and that way I can get to the question uh kind of separate from what I'm talking about right then and I will try um learning um not learning I will try it's hard to I it's actually very hard to talk and read chat messages at the same time but I will do my best um to try to pay attention to the chat as well um but it may be challenging so what else uh this is being recorded uh all the questions will be time coded so you can come back to this recording later and just click a button that says view answer and it will jump to that point in the video when I answered it and it will save you some time if you just want to watch that one thing um I'm not going to spend a huge amount of time on any one question because there are a lot of questions uh so this isn't like a 20 minute tutorial on something uh so I reserve the right to skip really complicated questions or questions I don't know the answer to because I don't know everything in pandas um I know a lot about pandas and I teach pandas but that doesn't mean I know everything so I may skip things I might not have a good answer for every question but I will do my best okay so um with that uh let's just take a quick look at the poll results and see who's here and I just clicked on polls and it looks like in terms of python experience uh it is most almost all beginner and intermediate um very few one Advanced and almost nobody who reports they have no experience just a few folks uh Panda's experience level no one who says they're Advanced um a lot of beginner and intermediate um but all levels welcome here and uh have you watched my video series on Panda's most popular answer is I've watched all 30 videos which I cannot believe um so thank you for spending all that time with me um that's like seven hours of video unless you watch me faster which is which is fine too um but uh yeah the um the video series by the way there's a green button just below the video says watch the pandas video series you can click and you'll get to that you can check that out after the webcast okay we're nine minutes in I want to get to questions so let us jump in um oh and I need to share my screen so that will just take me a moment uh right okay all right and let me minimize my face so you can actually read what's on screen that's the gist that I was talking about um I'm actually going to be in this IPython notebook which is not something I've posted but um that's the environment I'm going to be using uh here and you can use what you like at home okay uh one more second and let me Jump Right In with the first question okay question from Eric Chen uh actually one of my students uh can you show the best way to use Group by and pivot function on a panda data frame okay um the best way all right well I'll interpret that how I like um uh let's see what data set so I've loaded my data sets here and I'm going to use the drinks data set for this one and so I'm just showing the head of this data set um just so you have a bit of a picture of it so how might we use a group by well a group buy is to answer the question like for a given category uh for each category summarize some numeric about that category okay so one you might do is drinks. Group by the only one that would kind of make sense in this in this context is continent and I won't I won't say too much about the data sets other than like you know this is a country and it's uh alcohol consumption per year and this uh column contains the continent so if I group by continent and select for instance the beer servings uh series and I can check the mean so what I've done here with a group bu is to say for each for each continent Group by continent for each continent what is the mean of the beer servings series okay so um uh for each continent we now have it outputs a series and it has the name of the content and it's mean beer servings so the answer for when you use this is when you have a question that you could phrase as for each continent for each gender I want to know something about a group okay I want to summarize a group in some way what's the max beer servings um for each uh country sorry for each continent so you just have an aggregation function here at the end and that's the usage of group by uh followup question how to get multiple columns at once so uh you could do let's see um well let's see if you wanted you're asking how do I get the oh okay you're ask I think this will provide your answer and then we can all talk about it but um if you leave out the column it will do that operation on all of the columns so what is the max beer servings Spirit servings wine servings total leaders and Country uh for each continent okay and for the country column it just use the last one alphabetically okay um the second part of that question is about the pivot function I might I'll talk about that a little bit later um but that's probably more the subject the like of a full tutorial um okay pardon me one second all right next question can you explain the differences and when to use append versus conat merge versus join pivot versus pivot table versus cross Tab and stack versus unstack okay Obby here has done something very clever which is to put four questions in the same question uh Each of which could take a while but um I mean it's a good question uh so I'm going to give you um kind of short answers um because teaching all of these could take the whole hour but um append versus concat um append let's see so concat is a top level method meaning you would say like pd. concat and append is data frame method so you you would say like data frame name. append and um concat is a is a more General method um it has more capabilities a pend is more narrow so I actually never use a pend I only use concat and it's to concatenate either rows on basically you've got two data frames either with the same columns and you want to add them stack them on top of one another or they've got the same rows and you want to stack them next to one another that's what you use concat for so short answer I never use a pend I always use concat uh second merge versus join um kind of the same uh merge is a top level function so you use pd. merge join I think it's just a data frame method so uh like dataframe name. join and just like before um so merge is a more General um uh more General function with I think more power than join and I don't think you ever need to use join um I would recommend you always use merge okay uh I never use join um I don't think there's a compelling use case for join maybe there is but um I just recommend using merge all right uh number three pivot versus pivot table versus cross tab okay um I would say that pivot table is the most General and if you're going to learn one of those I would invest the most time in pivot table it has the most options um I think you can accomplish all the same things with a pivot table that you could with pivot and cross tab but even more so I would stick with a pivot table um stack versus unstack number four um so the way I use unstack is when I have a um let's think I have a multi-index and I want to go to a regular index let me pull up um let me think about let me get a good example for this um so that you can see something on this um so if I do drinks. group by continent uh dot your servings. describe okay that gives you uh a hierarchical um uh a hierarchical index okay it's a series but it has a hierarchical index okay now um if you want to turn this into a single level index you use unstack okay and it kind of moves the data around okay um I I don't know how to better describe it it changes the shape of the data um so it has a single level index stack is the oppos it um and it goes back from a single level index to this multi index or a hierarchical index okay so that's one example of kind of at a con I guess at a conceptual level what stack and unstack do um beyond that I don't I don't know what I want to say about that um they just they're reverses they're opposites of one another okay all right next next question all right from Raphael uh besides series and data frame I'd like to see some examples using panel data how to move from data frame to panel data and vice versa okay um you you guys start uh guys and girls are starting with the more complicated questions okay so panel data um let me come up with with a quick example so that we can have a visual and then I'll kind of comment on panel data okay so um let's see all right I'm G to create a data frame and right so I'm going to pass it a dictionary and I'm going to add a couple columns and what I'm going to do is I'm going to create the data frame I'm going to create a hierarchical index and then I'll show you how to turn it into a panel and back into a data frame um so let's say we have some names uh how about um I want to save on typing I wouldn't bother following along with me on this one if you're following along because it's it's going to be more than a little bit of typing uh so we've got some names um how about uh so those are maybe people's names these are days um one two one two and you'll see what how this kind of all makes sense in a second uh oh actually I need quotes around that because these are column names um we'll say weight so these are two people measuring their weights and Heights so we'll say 100 103 and 13129 think this will work and maybe height okay um how about this person grew an inch in one day and this person did not okay so sorry for that all that typing here's our data frame okay so it's two people A and B uh and they're recording the day day one or day two what was their height day and what was their weight that day okay so if I uh set a multi-index so I'll say DF do Set uh index okay um I guess I will let's see uh I think in the videos I've only shown setting a single index but you can actually set a multi-index and maybe we'll do uh name uh well it would make the most sense to do name and day and now we have a hierarchical index okay so what you're kind of looking at here is for each person we're looking at their kind of day um their you know it's kind of the natural groupings like um using Just A or B as the index you've got these duplicate days use if you use day as the index you've got duplicate names and what you really want kind of The Logical grouping in the data is you've got a person and then the day like this person here's the height and weight on day one here's their height and weight on day two uh same thing for B so this kind of makes more conceptual sense this is a hierarchical index and you can turn it into a panel by with this method to panel okay and you get something that I don't really understand um I gather that panels are from econometrics and I'm not I don't work in econometric so I don't really understand the kind of major axis minor axis kind of thing but it's it's three dimensions okay so that's why it's called three-dimensional data and it's got an items axis a major axis and a minor axis but really it's just and to get from a panel back to a data frame you use this two frame method it's really just representing the same things as this hierarchically index data frame but just in a different way and perhaps you slice it differently okay or there are different operations I think there are a lot more operations for data frames with hierarchical indices so I'm I'm not actually sure if you gain anything by um converting to panels and it's possible that in the future panels may just disappear so I don't necessarily recommend uh investing a lot of time learning panels unless it kind of makes sense for you but for me for this kind of structured data I would just use a hierarchical index okay all right next question let's see from Wolf Gang actually also one of my students what is the most efficient way to identify outliers in a data frame okay um outliers uh I think the best answer I can give is that um well first you need to decide what counts as an outlier okay so what is an outlier you could use some sort of statistical definition but that may not capture what an outlier means to you so I think the first thing is deciding what an outlier looks like um and uh one method would be to write a bunch of rules a hardcoded rules or filters to just filter the data frame looking for those particular cases I know that sounds very manual but it's a manual approach not a kind of machine learning approach um uh another way you might do it is with um like a visualization so um let me see what data set I want to use how about um the drinks data frame okay so still this data frame countries with their alcohol consumption amounts and how about um actually let's just use plot and let's use a box plot okay so whoops there's one thing I missed which is if you want plots to appear in the notebook you should use percent Matt plot lib in line okay and then let's run that again and we'll see our box plot and all of these plus signs are outliers according to the statistical definition now this doesn't allow you to see what like what is what is those what are those observations it allows you to see that for Spirit servings there are some outliers but I don't know if that's really useful to you so there's no like um catch all way to do outlier detection in pandas um I would tend to use a machine learning approach or I would tend to just explore the data and hardcode some rules for what counts as an outlier and maybe write some filter conditions for that okay um all right next question from um how do the pandas data structures series data frame and panels compared to our data frames okay is there any difference in ease of manipulation or efficiency between the two okay um all right let's think here uh I don't know so P let's start with data frames Panda's data frames and our data frames are roughly similar um the main things that I remember thinking about when I learned pandas after I learned R um is that in R your data frame uh row names like in our data frame row names are are I think like looked down upon or it's discouraged you're discouraged from from storing useful data in the row names okay um the the the idea in R and our data frames is you want to store all the useful data as columns okay now Panda data frames have an index and that index uh kind of looks like row names but the in pandas data frames the index is required there's no way not to have an index and no one will ever discourage you from storing useful data in the index and in fact putting something useful and ideally unique as the index is a good way to use it uh because it allows you to reference those uh rows using the row names okay so that's probably the biggest difference I noticed another difference is that in R there's like the there's a difference between a missing value and not a number and in pi well in pandas because of numpy um there is no difference between missing value and not a number they're the same thing okay so um that's the data frame differences um I there are probably more but those are the ones that come to mind um Panda's series is kind of like uh a vector in R and I don't think there's any big difference there um well I guess series have an index and vectors don't um I don't write much R anymore so I don't I haven't thought about this in a while um panel data I don't know if there's something like panel data in R um if someone in the chat knows feel free to uh chime in but I'm not aware of um something like panel data and R it probably does exist though okay uh great question okay um all right uh next question is from uh tox ads is how I'll say your name or moniker um hi for filtering rows I saw you sometimes use the syntax DF bracket row filter criteria and sometimes use the syntax df.loc row filter criteria comma colon is there any difference in performance or some other reason why why you should use the different methods or approaches okay um I like this one I um I have let me show you what he's talking about first um so that in case you're not familiar um let's see let us use uh let's try movies okay all right um so let's say I only want to look here's something I did in the video series I only want to look for movies with a duration uh greater than or equal to 200 okay that is one way to do it all right that is the entire data frame um okay and all I did is I pass this condition into this bracket notation okay and uh it filters the series okay now the alternative that is identical is this and I'll just copy paste for ease okay we'll do movies. L and I will explicitly put the comma in the colon okay so um movies. L okay um the L method is using for used for selecting rows and columns this is a condition that says what rows do I want and this says what columns do I want and it turns out that a colon means all okay so this is these rows all columns I could just say um I only want the genre column uh I may have to put it in brackets I can't remember okay um um so I could just get that column or I could get all columns the question was why do I do one versus The Other Well I teach the first one because uh it's what most people use so you need to understand that code you need to recognize it and know what's going on I teach the second one because Uhl is super powerful and you need to know how to use l and like I guess maybe in an Ideal World everyone would um write their code like this but you know programmers or you might be one um you know this like kind of looks this is a tiny bit harder to read it's got more kind of extraneous stuff and the first one just looks cleaner so this one's more explicit the first one's cleaner you can really use either uh the answer is no I don't use it for performance reasons and I'm pretty sure they have the same performance but you could check um the one exception I would say is I would not um uh here here's an alternative okay okay if I just want to select the genre column okay so both of these I know you can't tell but both of these out put the same result okay I just want the genre column I I um this is I'm getting the data frame and I'm selecting the genre I'm doing the same thing here okay so uh the um the thing is I would NE I would rarely do the first in this case I would always do the second okay now why is that okay um that has to do with the setting with copy warning that you'll see on occasion if you try to like assign something here and that's because this first line is actually two operations whereas this second line is one operation under the hood and so there is some efficiency to the second operation but the first operation can confuse pandas okay if you are trying to do an assignment okay so the bottom line is if you're going to a condition and select out a column I highly recommend using L if you want to understand this in more depth I've got a video about the setting with copy warning okay all right um oh uh in chat uh Francisco asked what's the difference between R and pandas in terms of the language uh I'm not sure I understand the question I mean they are like the structures are similar but not identical um but they have different syntax uh so it's learning a different language essentially um it it that's just the bottom line it's like learning a different language okay next question what about the jobs for pandas in Bangalore okay uh I'll well first I'll say I have no idea um about jobs in uh in Bangalore um or in DC for that matter um I don't uh I'm not looking for one and so I don't I don't I haven't thought much about it um but here's here's how I would rephrase that question um if you'll allow me reesh um I would rephrase that as how do I how do I get a job using pandas and I think this secret here maybe it's not a secret it's an Open Secret but companies don't hire you because you know a language um companies hire you because they think you can solve their problems okay so if you want to get a job with pandas or any skill for that matter or a data scientist job what you want to demonstrate um is not necessarily fluency with a language or a library Library those things are useful but what you really want to demonstrate fluency with is problem solving you want to convince the company that you know how to solve problems so if you're looking for a job with panda that uses pandas I would build a public portfolio PO on GitHub maybe also a Blog where you start with a question and a data set and you answer that question using pandas and you write about it so that they know you know how to use pandas they know how you they know you know how to solve problems they know you know how to communicate all of those things are just as important as the technical skills so that's what I'll say about getting a job using pandas okay uh on to the next question uh from Jeff uh how do you clean SL preprocess SL tokenize text in a data frame without appending it to one long list so that it is a series of lists so uh I appreciate the question I stared at it for a while and I don't I'm not sure I understand I don't understand what you currently are creating and what you want to create which are kind of the key aspects for me helping you to troubleshoot here um so I don't know how I don't know a specific uh kind of answer to give you here I will say that if I'm going to clean text I will do you if it's strings I'll use Panda string Methods um and those there are tons of them and I highly recommend learning Panda string Methods um tokenization I actually don't ever do in pandas that I can think of maybe some trivial tokenization but I would almost always do that in psyit learn using say count vector izer so um whatever structure uh you can always do something in pych it learn and then take the output numpy and put it back in pandas so that might be a solution um but ultimately it's hard for me to say without uh spending some time understanding this problem in more depth and what you're trying to do so my my apologies okay uh next question from Gus what is your favorite way to style a data frame on the web when exporting to HTML okay well data frames do have a a method and I think it's called um HT uh sorry I think it's just a data frame method like 2core HTML um that will output HTML I've act I've honestly never used it myself um so I don't have any ideas on styling uh data frames however my overall recommendation is just that um well I I put things in Jupiter notebooks um they are great for reproducibility like reproducible uh data analysis and data science so if it works for what you're trying to do and the audience you're trying to communicate to I would recommend putting things in Jupiter notebooks and that way they can see the code you use to get uh your output data frame maybe that won't work for your particular um your particular problem but that's what I would tend to do I'm sorry I don't have a better answer there all right um let's see uh oh okay uh Jeff followed up with well I can iterate over P this data frame and append it to one long list but what if I want a list of lists that is cleaned um I still don't quite understand I'd have to see an example if you write me like a small piece of sample code and show me what you have and then show me what you want I can help but um uh yeah other than that I just I'm not super clear on exactly what you're looking for um let's see there have been some other questions uh that have come up in chat and I'll just encourage you to put them click the button the orange button that says ask a question and put it there so anat asks what's the difference between pandas and numpy um Adrian asked uh what are the apply functions from R in pandas um uh someone asks do PI do pandas and um R use the same commands if you want me to answer them go ahead and put it uh as a question uh click the orange button and type it there because that way I can timecode it and keep track and people can upvote it if they're interested um so that's really what I would recommend okay back to the list top question um and all of these I'm answering right now have one vote so you can get your question right to the Top by posting it and hoping someone up votes okay so uh next question what is the best way to get value counts value counts normalized along with row colum and Grand totals in a single data frame okay Obby um all right let's write some code here um value counts uh what do I maybe I'll use um maybe I'll use drinks how about that all right drinks dot how about uh continent continent. value counts okay here is value counts okay um if you want value counts normalized um do that as well let's see uh and I think there is normalize equals true okay so that's the normalized value count so this is essentially a tally of how many of those entries exist uh this is the normalized version so what percentage are Africa what percentage are Europe uh row column and Grand totals um I mean if you're looking for like a row and column sum you could do something like this a drinks. suum axis equals zero and like drinks. suum axis equals 1 so those are totals so this is a sum over axis zero so it's essentially what I would call the column totals but some people might call them row totals I'm not sure what the quote uh proper terminology is um these are the inverse so this is uh moving cross from left to right taking the sums of those numbers um so those are how to get those pieces of data separately um how do you get them all in the single data frame and I think the answer is you don't because like here's how pandas and Excel are very different Excel you can kind of like put stuff um all like in Excel you can just put stuff all over a sheet and it doesn't have to be connected you could have like a chart in the upper left corner one data set as like a little on your on the right another data set below it they can all interact um pandas is not like that um a data frame has an index and that index identifies like the contents it identifies it's the identifier of the row and all of the rows kind of have to conceptually mean the same thing so when you ask for all of these things to be in the same data frame um my answer is you wouldn't put them in the same data frame now you could use contact to put these two things in the same data frame and maybe I'll do that right now I feel like I haven't been writing as much code as as um all of you are expecting so I will try to write a bit more code um so let's let's just save these as a and b and we'll do P.C concat and we'll pass it a list of objects to concatenate and we'll say axis equals one to put them uh next to one another and here are those two things next to one another but um I wouldn't know how to get these other things all in the same data frame um and just to kind of hammer home the difference between the axes in terms of concatenation this would stack them okay axis equals z would stack them axis equals one puts them left and right next to one another okay all right uh great next question all right someone followed my advice and people are voting on questions um love it uh isaro asked what's the best way to convert non-numerical data from multiple columns in a data frame to numerical form for machine learning purposes okay um so I've got two things I'm immediately thinking of and uh let's see how I can demonstrate them okay so here's our drinks data set let's look at drinks. D types and all my numeric column s are already numerics um but let's pretend that the beer servings column is actually a string okay and let's store it as um here okay now we've got this column that looks like a number but we need it to be but it's actually a string so let's take a look at the D types and uh it's actually object type which is because it's a string um so what we actually need is it for is it is for the beer column to be an integer so I would just say drinks. beer do as type int or as type float your choice and you will get the numeric version now don't know if that's exactly what you were asking for um oh okay you asked about for multiple columns in a data frame okay so if I needed to do this on a bunch of columns I would probably do it one by one so that I could make sure I'm getting the results I want um you can use like an apply method with the top level function to numeric um but that can do some unexpected things uh it can error out for various reasons you can tell it to ignore errors there's some downsides to that um there are um there is some discussion in pandas like I've seen it on GitHub about how to make this kind of stuff a little easier and there are different functions for doing this but none of them is like the one function to rule them all um in other words there's not like one recommended way and all the other functions are useless it's kind of this Patchwork of different approaches you could use and at some point in the future I have a feeling they are going to settle on a better way for this but at the moment uh you can research ways to do it all at once I would tend to do it one column at a time now if you're talking about here's my other answer if you're talking about categorical data okay and you want to take categories and turn that into uh put that in usage in a machine learning model I recommend using dummy variables and I've got a whole video about that just um check out the videos and look for dummy du mm and you'll see how to make those conversions but that's um kind of more than I want to get into uh during the webcast okay um great question though all right um tokes ads uh let's see whoops oh something else just jump to the top sorry uh right uh okay jcv I use the vlookup function in Excel all the time is there a way to do this in pandas okay the answer is yes um let me think I made a note to myself on what data set I wanted to use uh to answer this let me find that um oh okay I've got it okay I think this will this will help all right um I wrote down I want to use the movies data set and I remember why okay here's the movies data set and let's um so vook up if you don't know what it is I don't know how to like succinctly describe it other than you're like translating well I'll show you how to do it in pandas and then even if you have no idea what V lookup is you'll know how to do conceptually the same thing so I will skip trying to explain the lookup without having Excel open Okay so let's just do a value counts of content rating content rating. value counts okay so uh here are the different options for Value counts and let's pretend so here's my kind of uh problem that vlookup would solve in Excel okay I want to Define um I want to Define like a mapping where if you have content rating of r i want to put um let me just actually make the mapping with a a dictionary um and I'll just call it mapping if you have content rating r i want to map that to um like no kids will be at this movie If I have a Content rating PG13 uh there will be maybe I'll just say um say no like no kids will be at this movie Yes kids will be at this movie um PG uh we will say is also yes okay so this is a mapping from like what I want to be able to do with the vlookup is every time there's an R I want to add a column that says no and every time there's a PG-13 or a PG I want to add a yes okay so that's conceptually what V lookup is doing um so how would we do that it's actually pretty simple um if I remember how I wanted to do this I think we are going to just use map um mapping yeah okay so that actually did it um and uh let me add this as a new column so you can well whatever um the point is all I did with this map with this um series map method is I told it like here are the source values if you will and here's what to substitute for it so R maps to no PG-13 maps to yes PG Maps DS so thus we should get no no no yes no and indeed we get no no no yes no and I get some Nan's because um I actually didn't Define all of the different possible mappings I didn't Define G Etc you know uh G yes okay so I didn't Define all those mappings but if you had a complete chart this mapping uh this map function would result in a series of Nos and yeses okay all right [Music] um all right let us move on to the next one um I'm a total newbie to pandis how can I best prepare for the webcast well uh thanks Ray for the question um I you know obviously the webcast is happening so I can't give you a good answer there but I can tell you that after the webcast um if you're a total newbie to pandas I strongly recommend my pandas video series um I put a huge amount of effort into uh teaching pandas kind of from from ground zero and if you watch the series you should get a strong kind of foundation in pandas okay um tokes ads uh I would like to get something similar to a cross tabulation but with my own Master Rose and master columns my data okay um I won't read all this out um I so it's it's really hard for me to picture this without seeing it um it would be I could maybe do this if I had like an exact like example and here's what I want um but I don't think I'm going to be able to come up on the fly with an answer I would say maybe a pivot table but I'm not sure Without Really seeing uh exactly what you're looking for um sorry to disappoint I just some questions are just too complex to answer on the fly all right uh next question what is future warning in pandas why is there a function to avoid it boy there are a lot of questions left I may speed up and see what I can get through what is a future warning okay a future warning is when pandas tells you that a function you're using is not going to be supported in a future version of python it's been deprecated or it's going to change how it works in a future version of python as such um you should probably rewrite your code um to not use what you're doing because eventually your code is going to break so that's the simple answer it's a warning it's not an error but in a future version it will become an error okay or it will not work as you think as it used to work okay why is there a function to avoid it uh Newman asks and uh so I don't know the particular function but I know that some people like to suppress warnings um why because they don't like to be reminded that their code will one day break maybe [Music] um uh so I would I don't recommend turning off warnings um because it's better to just rewrite your code code okay all right um all right there's a follow-up question uh from the last one about the mapping uh what if mapping was in another data frame so what if let's see let's see if I can um kind of translate this question so what if uh it was like this I think he's saying so I'll just say map map mapping to I don't know I'm not being creative here um so let's see what will happen if I pass this into the data frame that will not give me what I want let's see I would need to change the shape I would need to say like um need to say like ratings uh it would be like R uh PG13 so let's start with that and then uh kids colon uh no uh yes okay so I may have gotten done this right this time okay so you're saying how what if I had mapping two okay how would I use that that to do my um uh sorry to do my mapping well let's see I could figure out how to if I um cast mapping two to a series um see what happened pd. series mapping 2 will that do anything no not exactly what I wanted um I think the bottom line is uh number one I might try to translate uh mapping to into a regular uh python dictionary with this as the keys and this as the values alternatively I might do a set index of let's see I would use ratings and then uh I would do a I think I would do a merge with um would that work I think I would do a merge between this data frame and movies and I would tell it to join on like well left on well if you've done a merge um well I could try it right now uh pd. merge and I don't remember all the what the um uh parameters movies maybe mapping to left on equals um left on equals content rating right on equals uh ratings no on no on the index actually okay um oh I didn't set it in place so actually I will say right on r can't remember this might okay um let's see if it worked oh sorry that oh great kids no where are my nonr movies oh because all the oh okay yes it did work um or at least it partially worked I don't know if there's any Nan's left but the point is using a merge I could map my RS to Nos and pg13s to yes I wouldn't actually set the index and I didn't do in place which is why it didn't change so um we'll just here's what I'm actually merging on okay so that's uh probably what I would do okay back to the list of questions wow 27 questions to left I'm going to keep going uh I don't know when my voice is going to run out but uh keep voting because it will affect what ones I get to I'm pretty sure I will not get to 27 more questions okay all right from Mr B I want to remove the last 30 rows of my data frame I tried playing around with IO and drop but I couldn't get it to work all right let's use a different data frame uh let's use UFO okay okay um ufo. shape is 18,000 rows and Mr B wants to remove the last 30 rows so what I would do is IO and IO is about integer position so what rows do I want I want all rows to the last to 30 from the end I think that should work and all columns I do that backwards let's see um sorry my computer is slowing down here um so it went through 18 yeah I think that worked okay so um let me just confirm with that shape so all I did is I said integer positions everything up to 30 from the end okay this is like list slicing you know if um you have uh range um 100 okay x equals okay you have a list of 100 numbers now let's print and if I want to get all except the last 30 I would just say um negative uh sorry no 230 so from well and let me print that so you can see it so that's everything except the last 30 so I'm doing the same thing with IO I want all the rows except for the last 30 and then I want all columns okay and it removes those 30 rows all right from gok tug I want to oh sorry Start answer I want to create a column for yearmonth from the date time column for example if date time is 2016 076 15 59 56 19 I want to map 201607 into a column okay great question okay so let's use UFO UFO data and this time um is uh a date time because I Define that during the import okay so because of that I can do things like this ufo. time. dt. year okay and I get the year out okay and I'm going to coers that to a string okay and I'll store that as a new column called year and then uh I'm going to get the month ufo. time. DT DT is how you access the datetime options okay dt. month and that looks good except I want some padding here um and I know there is is a padding function trying to remember if it's do um I think I need to First change it to a stir and then I can do dot that's um pad does that work all right let's figure this out Panda API pandas API St str. uh padding is what I want let's do oh it ispad okay pad to uh Phill care equals fch equals zero oops did something wrong Phil Char width sorry my computer's being slow uh pad all right width equals to try equal zero series object okay uh oh okay this is a string method now I got it okay here's what I did I pulled out the I know it's a long line of code but I pulled out the um time series then I pulled out the month attribute I coerced it to a string and then I'm using the string method pad and say pad width equals 2 and the padding I don't want to pad with a space I want to pad with a zero okay now [Music] um uh Montage says for date time can we use Stir F time I don't know I'm not sure what that is um I try to do as much as possible in pandas because pandas is so powerful and things you don't end up having to like use apply functions and write for Loops uh and write if statements you can just get it all done in one line of code so that's how I tend to do it but there's always a different way to do things so UFO months that and how do I finally put this all together and actually answer the question um UFO do year Doster dot um what is the combin method let's see if I want to is it concat work I think S.C concat I don't always remember all of these pandas functions no it's not concat it is let's see cat stir. cat Okay stir. cat uh ufo. month SE equals Dash I think that'll work and look at that all right problem solved okay so uh I used I took the year and I said concatenate that with the month and separate it with a dash okay so um that is uh and you would just you know you could store this just as a new column okay and uh ufo. head and there you go here's the time here's the new column there's probably a Slicker way to do this but um I don't know it um so feel free to comment on the question uh Al said you might be able to accomplish this with DOT two period uh I don't know um I'm not familiar with that but it's certainly possible okay um okay I like this question what's the best way to create a new column or multiple columns based on pre-existing columns in a nested for Loop and I'm not going to read the rest of this um the answer if you're saying how can I do this with a for Loop the answer is usually don't do this with a for Loop okay you rarely need to use a for Loop in pandas now um let me think how um what can I use for this um to kind of give an example um all right let's use sorry let where are we here uh let's use um okay so here's the kind of thing that the uh at 2K I think is uh is asking about he's trying to make a column that's like if sex is male and age is less than 30 then the column should say young male or something okay um that's conceptually what he's doing he's saying if this is something and this is something then put whatever otherwise put something else okay that that's conceptually what he's doing so let's break this down okay let's break this down um the bottom line is you don't need a for Loop okay you're going to do this in one line of code and it's actually not that hard so let's say let's say the condition is um one of the two conditions is train. sex equals male okay and you'll see that that outputs a series of trues and fals is right and let me um yeah I'll uh this let me make this little smaller so there's more space down here okay let's say one of the conditions is sex equals male and the other condition is train. AG is less than 30 what I've done is I've said this condition and that's what the Ampersand means this condition okay so um you want to combine those uh two conditions such that the end result is true only if both of those conditions are met okay so the first three that we'll focus on true false false that's because this is a male younger than 30 this is not a male so that doesn't work this is not a male and even though they are 30 that condition is false because there's the Amper sand so I've defined two conditions and I've said if both are true output true otherwise output false so how do I turn this into a new column well uh how about we do map and we map True to uh oh actually I need a dictionary here m true to uh yes and false to no and we'll call that train bracket uh young male okay and let's look at the head again and look at that we've got a new young male column that is yes if it's a male below 30 and and no otherwise like here's a male that was 35 so it said no okay so you don't have to write an if statement or an else you don't have to write a loop you just need to write conditions that output TRS and falses and you can map those TRS and falses to a variety of um uh well to different outputs you can map this as true becomes one false becomes zero like that would also work and whoops uh I put exclamation point true becomes one false becomes zero that would also work um so there's a lot you can do with just conditions okay all right uh Graff graphos uh I'm sorry I I'm sure I mispronounced your name uh I've asked I've joined this to see if it is worth to change from R to python oh boy that is a complicated question um so it depends on your task but um I personally like python better but that's because of what I do in Python um I work with text a lot and I don't like working with text in R um I do a lot of machine learning and I prefer machine learning in python as far as like data munging data cleaning data analysis I I'm fine with both pan and if I'm doing it in R I would use deer um I don't think I mean pandas and deer they're different um they use different syntax somewhat um there definitely is a lot I like about deer um but uh should you change to python well my belief is always you should um a couple beliefs one is you should learn one language really well before you move on to the next language um number two is there's like a high cognitive cost of learning a new language and all of the different packages you need to know to do what you used to do in the old language um so uh that's an argument for just investing more into the language you already know um however if you work somewhere that is um you know uses a lot of python that's something to take into account um if there are conferences that seem to focus on Python and your your area of interest um and they don't have an R conference in that area you might want to think about learning python for that reason um you know it's it's really there's not like a right answer um both uh Python and R are great languages for data science um but if you have a very particular task there might be some compelling reason to use python or R um both are popular at least for data science um python is more popular than R for general purpose applications um so if you're building a web app off of it it might be better to Learn Python but it's uh there's not like a strong one way or another answer I can give you okay I cannot even believe I haven't even answered half the questions but maybe the ones below are easier okay uh how to find the top uh manes says how to find the top repeating number from an array with their frequency okay that is an easier one uh thank goodness so I can uh answer it quickly um I'm going to use train I'm going to use the value counts method so if I do train. PCL class. value counts I will get the number of rows that had the pclass of three there were 491 the number with uh there were 216 1es and there's 184 twos and that's just a column from the data frame like similarly if it was sex it would tell me that in the data set there were 577 males and 3 14 females okay um next one from oyin how can I fill na based on a condition say I want to fill na for all missing cities in the UFO data set but only if the color is red okay so this is um kind of a combination of some of the things we've seen before um Phil na for all missing cities all right so we got the UFO data set and sometimes uh the city is null so ufo. uh city. um is null. some there are 25 values of City that are null so how do I fill that but only if the color is red okay so you would do something like this um ufo. L and what rows do I want to find I want to find the rows where ufo. city. isnull um and where sorry uh and where UFO dot uh sorry bracket colors reported uh whoops equals equals uh red and I I have to let me look at sorry my mouse is acting up here so I'm having a little trouble navigating um let's look at colors reported I think they're all caps um colors reported. value counts counts um yeah it's all um it's all uppercase okay so colors reported equals red so just on its own this just shows me those columns so I've got ufo. L and what rows do I want I want the rows where the city is null and the colors reported is red what columns do I want I want all of them okay now how do I fill na based on a condition um well if I want to fill the city in that case I would just put it right here and then I would say new value okay that's what I would do that's it I think that's I got that right um so dolo is great when you want to select out certain cities I want to say when the for these rows just specified by a condition and this column I want to set a new value okay all right one second while I uh ease my throat here um okay um next one any practical tips for parallel processing data frames how to achieve something like asynchronous apply um parallel processing not my area of expertise I'm sorry sorry um can you do parallel processing on a data frame um certainly using spark you could you would use a spark data frame but can you do it with a pandis data frame I'm not sure uh if there's anyone who knows how um uh if there's anyone who knows I would love it if you would post an answer but I'm I'm sorry I just don't know okay uh Adrian asks what are the similar apply family functions in R from R in Python pandas okay in R you have like apply and S apply and T apply and L apply and R apply I've actually forgotten what all of those do however there is an apply uh method there's actually a data frame apply method so you can do ufo. apply um or more commonly ufo. city. apply and you can pass an arbitrary function like um I think that'll work uh nope um what did I do um object type float uh I'm doing something wrong but it's not completely obvious to me what object of type float has no l all right um anyway I won't try to demonstrate in code um the bottom line is everything you can do with those apply methods I'm pretty sure you can do in pandas I don't know offhand uh how to translate directly translate all of them but I do know there is a UFO there is a data frame apply method like this and there's a series apply method like this uh and you can pass it arbitrary functions um I've got a video on using the apply method think it's video 30 in the series I would check that out um but I don't know how to like directly translate the code without relearning some R that I have uh forgotten all right Marvin why isn't it called koala um instead of pandas I assume pandas means uh like pandas is short for panel data I think and um that's why it's called pandas it's not because of the animal but it's kind of fun as an animal um so that's why the name is as such uh let's see om what are some examples of 3D data that panels are used for um I don't think of like as I mentioned previously I don't think of panel data as 3D data as much as I think of it as a data frame with a hierarchical index um so that's the kind of I guess I don't really know how to answer it other than that I don't think of it as 3D data would cartisian coordinates be considered 3D you could put that in a panel but I don't know if it would accomplish your objective so you should always start with your objective and then um end with the solution and is the panel data the right solution probably not is a hierarchical index the right solution I don't think so but it depends on your goal okay um so that's the best I can say there uh tokes ads um are you planning to continue the video series about pandas in any way um or is it just going to stay at 30 videos provided don't get me wrong the series is great but I think not all Panda topics are coming yet uh of course not all the of course is of course not everything is covered uh covering everything would probably require hundreds of videos uh and I don't think I could make hundreds of pandas videos I would get tired of it eventually um am I gonna continue it well uh probably um I enjoyed making it a lot um the 30 videos I did create uh the videos at the beginning took me about four hours per video to make and by the end because the topics were getting more complex it probably took me8 hours a video um so if I want to continue with the series that's kind of me committing to like using one day a week um of my work day of my work days to give away a free video which I probably will do but I have to think carefully about that tradeoff because I've got a lot of other projects in mind so working one day a week uh on pandas means only four days a week for any other projects so uh yes is the likely answer I don't have a timeline uh it will be intermediate topics I assume because I've covered most of the basic topics but I can't promise when it might be okay Scott preferred visualization method for exploring data native pandas uh map plot lib ggplot Etc um I use pandas as much as possible POS because my data is already in a data frame uh and I can do it in one line of code um most of the time um Matt plot lib is what's working under the hood and I tend not to write Matt plot lib code because I I find it a little bit painful to do um it's just not how I think um so I find it challenging so I tend to I mean I mostly make exploratory graphs anyway so uh doing it in pandas seems to work fine most of the time um GG plot I have not played around with but now that it's under active development once again I am planning to try that out because I do enjoy ggplot in uh in R I really like ggplot in R so I will probably learn ggplot in Python at some point and see how I like it there but I can only imagine I will like it because I like it how it works in art uh Seaborn is another option and I like using Seaborn um it's it produces kind of pretty visualizations um it makes certain things that are painful to do in pandas or M matplot lib it makes them easy um so Seaborn is another one uh and I like Seaborn um I've used it many times um but the one down side is it's yet another interface to learn it's yet another library to learn so the short answer is I do as much as I can in pandas okay um all right Al asks when and how should you use filter the pandas data frame filter I have never heard of filter but I will pull it up on on my second screen and scan it uh and on first glance I have no idea when to use it um I would say that may I don't know I don't know I I'd have to look into it I've never used it personally sorry about that Al okay Al asks he's getting uh you know if you vote you'll get you'll get your your qu Al uh percent percent time versus percent percent uh so time it run something three times and I don't think it out let's just try that so uh if I do percent percent time it uh whoops let's see it's a cell magic but the cell did I what did I do wrong here got um oh is it just ah yes sorry one% okay so I'm not sure like I get confused about the one versus two percentage signs but if I want to use timeit um so time it runs it three times and tells you how long it took time runs it once and actually outputs the result okay so I usually use time because I actually care about the result you use time it if you don't care about the result you only care about seeing how fast it ran okay so that's the difference I don't know the parameters for time it okay Abu difference between P spark data frame and pandas data frame need to know when to use each um so Panda data frame is good for inmemory computing okay so um spark is good for distributed computing you shouldn't use spark if you don't need distributed computing um but if you need distributed computing uh I don't think you can do it in pandas to my knowledge maybe you can and if so P spark or spark in general accessed by py spark um has a data frame structure and that data frame structure has like similar methods uh to the the pandas data frame but they're not identical uh pandas is going to be easier to use eventually uh spark uh may catch up but you're better off sticking with pandas unless you need distributed computing okay uh don asks uh tell me why I can't lot there's no graph in my result um just like this okay so what um the problem uh Don had is he's trying to plot something and it didn't appear and if you recall um that happened to me earlier in the webcast before I said Matt plot lib in line okay so in the Jupiter notebook if you want to want plots to appear in Notebook you have to run Matt plot lib in line at least once well you only need to run it once now if you're not in IPython or the Jupiter notebook okay I think what you need to do is to import Matt plot li. pyplot as PLT and PLT doow okay I believe that's what you need to do if plots are not appearing in your python environment um and you're not using IPython or the notebook you run these this code um but I could be wrong I'm in the notebook so often that I sometimes forget okay Kieran how to check if two data frames are the same equals is not working also how to find the Delta between two data frames H uh okay I actually don't know if there is away I mean um so he's saying that like say uh well let's do a little test um uh let's see how can we do this ufo2 no I don't want to do that one um drinks 2 equals pd. read CSV um. Le drinks by country okay so let's do that and then let's also do drinks and we're saying drinks equals equals drinks 2 and ah okay it outputs a list it outputs a data frame of um of uh booleans okay so here's what I would do I would say uh Dot Su um we'll change it to not equal okay so this is what I would do I think so I'm checking H in how many cases the cells in drinks are not equal to drinks too and there's no difference in this case okay now if I changed like drinks bracket let's do do IO um Z comma 0er to be uh something else okay now there's one difference okay so that's the method I would use but that relies on them being mostly uh well they'd have to have the same index at the very least I think they'd have to have the same number of columns and everything so okay if you can believe it I'm still going and there are still questions uh all right I'll I'll keep trying uh a lot of folks have still stuck with me so um uh a lot of folks have still stuck with me so um I'm gonna keep going because my voice is not run out from Erica how would I go about removing subgroups from one data frame column um okay so what we're trying to do is filter a data frame by multiple categories okay so let's say um let's look for um let me think which is a good one how about uh we'll go with drinks too okay so filtering uh by multiple category so if I only wanted to keep Asia and Africa I would say drink .c continent do um what's the method I'm looking for uh is in and I'd pass it a list of Asia Africa okay and that will show me only the rows where continent is Asia or Africa if I want to get all the ones except Asia or Africa I would just add this till day which means not which means reverse trues to falses and falses to trues and I get everything except Asia and Africa okay so that's what I would do make sure when you have these cases again I will emphasize you don't need to um write for Loops uh I'll just glance at the chat for a second um uh let's see actually I'll come back to the chat at the end if I still have time and a voice um all right uh from Joe is there a way to get describ to include mode and median values okay um good easy easy one at least for me but if you're new of course it might not be easy um so if you say drinks. beer servings. describe it tells you the median that's what this 50th percentile means these are the percentiles this is your five number summary your Min your max your median your 25th percentile your 75th percentile and so there's your median okay if you have a categorical column like continent. describe it tells you the mode and it's right here it's labeled as top okay uh balam Muran how will I specify all else during a map command for example if I want to map male as one and everything else as zero in a data frame column how can I do that okay um let's see let's um let's use head again iuse actually use drinks to because I didn't screw it up drinks two and uh let's let's see let's say I want to create a new column where if Asia if it's Asia then I want a one otherwise I want a zero okay um drinks 2. Asia sorry not Asia drinks 2.c continent equals Asia and that uh well actually there's a really simple way to do this it gives me trues and falses and I can just say as type in and it gives me ones and zeros okay so everywhere there's that it's Asia you can get a one otherwise it's a zero um if you had a more complicated scenario um you could do like uh you could do an or and there there might be a simpler way to do this um actually there is but I'm not going to do it right now continent equals Africa and then wrap this whole thing in that and now we get those and we could say well we could just there's lots of ways to do it from here I whatever example I was thinking of I lost it but I was going to add some complexity but no need at this point um that's how I tend to do things a lot I use as type a lot I use map a lot and I use conditions a lot um Scott asks I use modes SQL python combo are there performance reasons to do aggregation in SQL or should I just pull the granular L data um so pandas is pretty well optimized um I'm not sure kind of the two options you're giving me what you mean by pole granular level data versus aggregation in s SQL um just speaking to Performance um pandas is pretty highly optimized uh you could write the same code you want to do in P in SQL using pandas which is faster it's hard to say without um it's not my area of expertise but they're both generally well optimized um depending upon the particular operation so um it's hard for me to give strong advice here other than I try to do as much as possible in pandas because I know it well but if you know how to do uh SQL well you can do some stuff in SQL first and then export the results and then load into pandas that point uh Miguel is there an easy or automated way to switch or translate from a script do it to a jupyter notebook um yes there is um well from a uh from a script to a Jupiter notebook you yes I don't know it off hand but I would just copy and paste it into the notebook from a notebook to a script what you want to use is something called NB convert okay um Jupiter NB convert and you do it at the command line and there are some examples here and it's how I take my notebooks and turn them into scripts so you end up writing code like and I'll just um I'll just put it in like would do this at the command line you would say something like um sorry uh you know Jupiter uh NB convert Das Dash to Python and I have a template so I would say template and then uh template I I have this template called clean TPL and then uh I pass it like the name of the notebook my notebook. iy and and something like that will convert a a um jupyter notebook to a python script and again you do that at the command line okay uh Kumar uh can you please explain how to detect and filter outliers I actually talked about that earlier um you just have to Define what an outlier means and either visualize looking for it or write some hardcoded rules or you could perhaps use machine learning Obby asks when are you going to start a video series on tensor flow uh I don't know I may eventually uh it won't be in the near future but I do get a lot of requests on uh deep learning series um Kumar asks what's an indicator Matrix um I think that's pretty far beyond the scope of the webcast so I'm G to leave that alone well I mean I think I think offhand I know what it is actually I might be thinking of the identity Matrix which might be different from the indicator Matrix so I don't want to get the wrong answer here and give you bad advice so I'm simply just not going to answer that one uh today okay Carlo about Jupiter notebook with print DF the text is getting right aligned when I use DF a float is left aligned is there a way to get the float in the data frame aligned on the decimal sign not that I know of um but as always feel free to make a comment um under this question if you know personally okay uh Anette what's the difference between pandas and a numpy ray um so pandas is built on top of numpy uh numpy um under the hood everything in pandas is stored as numpy arrays um there may be some exceptions to that but I'm I'm pretty confident in that uh statement um and uh the thing with numpy is that everything has to be the same type but pandas makes it so you can have heterogenous types so you can have different types in every column different data types in every column pandas gives you a lot more highlevel operations numpy is more F focused on how to do computations really fast pandas takes advantage of that but the bottom line is if you're working with kind of real world data I generally recommend using pandas rather than numpy Kieran asks what are primitive methods in pandas are they more efficient than others um I actually don't know what A Primitive method is um maybe it's a method that's called by other methods I would say the following don't overthink how to get something done in pandas if you if you know of a way other than like okay here are the signs that you should probably try doing things a different way number one you're writing if statements number two you're writing for Loops uh number three you're wrri um apply statements or apply method now apply is sometimes super useful but but sometimes you can replace it with a built-in pandas function so that's how I would optimize my code um I don't know about primitive methods I'm gonna guess that um well I don't even want to guess um bottom line is I wouldn't focus on which pandas method I should use if I'm using a built-in pandas method it's usually been optimized and you don't want to try to overthink it all right I think the final question that's been posted when to use the brackets when writing conditions to filter always safe to use them um brackets are when you want to pass it to something else so if my condition is drinks. continent equals equals Africa brackets don't belong there okay however if I want to pass that Boolean series to a uh to a data frame in order to filter rows I use bracket notation and that pulls out the rows in which there was a true similarly if you're going to use Lo or IO orix or IAT which I don't even know what it does I just know it exists but these dot like Lo and IO Especial you need to use brackets and that's just how it works okay uh oh he asks uh when to use okay I meant combining two conditions okay great question so he's asking if you have two conditions like drinks. continent uh let's say or uh equals equals um Asia okay you actually if you use multiple conditions in pandas it's actually required that you use parentheses okay it is required okay um whoops what did I do not L that's not what I want I just want plain old this okay so it is required that you use parentheses if you are using multiple conditions and passing it to bracket notation okay okay that is all the questions there um I will glance through the bottom of the chat for other questions Uh Kevin how long Francisco asks Kevin how long did it take you to master R before changing to python um I I don't know um I don't know if I mastered R I don't know if I mastered python uh I spent maybe a year in R before I started learning python maybe six months in R um but I wrote a lot of R code during that time so that helped um yeah you know that's I guess that's all I would say um number of Loops uh Al asked for time it what's the difference between number of loops and number of times the cell ran um uh the number of I think that's the same thing I guess I'm not sure um okay araj asked please recommend a learning path for a successful career in data science I know it's off topic but I'm sure but I would be glad if you answer this okay I'm gonna give you a secret okay uh so if you go to data school this is not anywhere public so to those of you who have stuck around the entire time I want to give you something special so go to datas school.i talk python whoops talk python okay datas school.i talk python um so this is something that I gave to uh the you listeners of the talk python to Me podcast okay and it's a guide on launching your data science career and I spent a lot of time writing it and uh I will eventually make it kind of fully public but um since you asked and since you stuck around till the end I wanted to give you that so again it's dat school.i talk Python and I'll just paste it in chat for you okay so check that out I will eventually make this maybe a blog post or I will uh or something else but um for now it's just kind of set as private and it's got all my advice in there okay um let me uh let me go back to my face because I am done showing my screen uh let's see all right one second and close the screen all right so to those of you who very I am very impressed you stuck around all the way to the end uh thank you so much for joining me a couple end notes before before you leave number one the recording will be available shortly on this page uh the all the questions will be time coded you can click on it and it will jump right to your question okay um if you haven't already I highly recommend watching my pandas video series it's free on YouTube there's 30 videos just click the button underneath the video that says watch the pandas video series if you enjoyed this and want to see another webcast um I would encourage you to follow me on crowdcast so find something I've written in chat and I think you can just like click on my name and it will take you uh it will take you to a page that's like my profile page and you can click follow and then you'll get announcements every time I have a public webcast um if you want to follow me elsewhere um go to my website it's just datas school.i and there's little buttons here for Twitter LinkedIn GitHub I have a lot on GitHub I'm active on Twitter I'm going to launch a Facebook page and I'm I've got obviously lots of YouTube videos so if you want to follow me there um feel free okay um I do teach an online course called machine learning with text in Python there's a link to it on my website it just go uh you know scroll below uh this sidebar and you'll see something that says data School courses click on that if you want to learn some more from me and I have online courses and um let's see uh if you're not already on the on the data School newsletter I've added most of you I'll be adding the rest of you soon you can unsubscribe of course but it's a great way for you to hear about new tutorials um it's great way to hear about when I record new videos because quite frankly subscribing on YouTube you may never like YouTube doesn't necessarily email you when I come out with a new video you might may never see it in your feed so uh that's why I like to share everything with my newsletter those that's kind of like my community um really uh glad to have the data School community so um again I will be adding you to the newsletter shortly if you're not already on it and um you can definitely keep in touch with me that way all right just doing one last scan through the comments see if there's any other things uh Tok sad said will the recording be on YouTube yes I will upload it to YouTube um Kieran said it would be great to have a webcast on Psychic learn I may do that I don't know if it will be this year I would love to do it I I let my newsletter subscribers vote and they narrowly picked pandas but uh you know so many many picks I could learn that I I kind of want to do one of those as well um and uh there you go so it was such a pleasure for you guys to join me you guys and girls to join me today um thanks again it was really fun I'm gonna go eat some lunch and rest my throat have a good uh rest of your day and I'll see you on YouTube you can reply to my newsletter and I will get your email and I'll usually respond so okay again uh thanks for joining me and I will see you later uh that's it
Original Description
During this two-hour webcast, I answered 45 viewer questions about pandas, the leading Python library for data analysis, exploration, and manipulation. View the complete list of questions on Crowdcast: https://www.crowdcast.io/e/pandas?rfsn=402783.36d99
Here is the code for loading the datasets used during the webcast: https://gist.github.com/justmarkham/5c04d245cc70cdbd00f00a2bae5a54da
Follow me on Crowdcast for announcements about new webcasts: https://www.crowdcast.io/justmarkham?rfsn=402783.36d99
Want to learn pandas from the ground up? Watch my pandas video series (30+ videos): https://www.youtube.com/playlist?list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y
== LET'S CONNECT! ==
Newsletter: https://www.dataschool.io/subscribe/
Twitter: https://twitter.com/justmarkham
Facebook: https://www.facebook.com/DataScienceSchool/
YouTube: https://www.youtube.com/user/dataschool?sub_confirmation=1
JOIN the "Data School Insiders" community and receive exclusive rewards: https://www.patreon.com/dataschool
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Data School · Data School · 55 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
▶
56
57
58
59
60
Setting up Git and GitHub
Data School
Navigating a GitHub Repository - Part 1
Data School
Forking a GitHub Repository
Data School
Creating a New GitHub Repository
Data School
Copying a GitHub Repository to Your Local Computer
Data School
Committing Changes in Git and Pushing to a GitHub Repository
Data School
Syncing Your GitHub Fork
Data School
Allstate Purchase Prediction Challenge on Kaggle
Data School
Troubleshooting: Updates Rejected When Pushing to GitHub
Data School
Hands-on dplyr tutorial for faster data manipulation in R
Data School
ROC Curves and Area Under the Curve (AUC) Explained
Data School
Going deeper with dplyr: New features in 0.3 and 0.4 (tutorial)
Data School
What is machine learning, and how does it work?
Data School
Setting up Python for machine learning: scikit-learn and Jupyter Notebook
Data School
Getting started in scikit-learn with the famous iris dataset
Data School
Training a machine learning model with scikit-learn
Data School
Comparing machine learning models in scikit-learn
Data School
Data science in Python: pandas, seaborn, scikit-learn
Data School
Selecting the best model in scikit-learn using cross-validation
Data School
How to find the best model parameters in scikit-learn
Data School
How to evaluate a classifier in scikit-learn
Data School
What is pandas? (Introduction to the Q&A series)
Data School
How do I read a tabular data file into pandas?
Data School
How do I select a pandas Series from a DataFrame?
Data School
Why do some pandas commands end with parentheses (and others don't)?
Data School
How do I rename columns in a pandas DataFrame?
Data School
How do I remove columns from a pandas DataFrame?
Data School
How do I sort a pandas DataFrame or a Series?
Data School
How do I filter rows of a pandas DataFrame by column value?
Data School
How do I apply multiple filter criteria to a pandas DataFrame?
Data School
Your pandas questions answered!
Data School
How do I use the "axis" parameter in pandas?
Data School
How do I use string methods in pandas?
Data School
How do I change the data type of a pandas Series?
Data School
When should I use a "groupby" in pandas?
Data School
How do I explore a pandas Series?
Data School
How do I handle missing values in pandas?
Data School
What do I need to know about the pandas index? (Part 1)
Data School
What do I need to know about the pandas index? (Part 2)
Data School
How do I select multiple rows and columns from a pandas DataFrame?
Data School
Machine Learning with Text in scikit-learn (PyCon 2016)
Data School
When should I use the "inplace" parameter in pandas?
Data School
How do I make my pandas DataFrame smaller and faster?
Data School
How do I use pandas with scikit-learn to create Kaggle submissions?
Data School
More of your pandas questions answered!
Data School
How do I create dummy variables in pandas?
Data School
How do I work with dates and times in pandas?
Data School
How do I find and remove duplicate rows in pandas?
Data School
How do I avoid a SettingWithCopyWarning in pandas?
Data School
How do I change display options in pandas?
Data School
How do I create a pandas DataFrame from another object?
Data School
How do I apply a function to a pandas Series or DataFrame?
Data School
Getting started with machine learning in Python (webcast)
Data School
Q&A about Machine Learning with Text (online course)
Data School
Your pandas questions answered! (webcast)
Data School
Machine Learning with Text in scikit-learn (PyData DC 2016)
Data School
Write Pythonic Code for Better Data Science (webcast)
Data School
Web scraping in Python (Part 1): Getting started
Data School
Web scraping in Python (Part 2): Parsing HTML with Beautiful Soup
Data School
Web scraping in Python (Part 3): Building a dataset
Data School
More on: Python for Data
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
When AI Asks for More Electricity Than a Country Can Imagine
Medium · AI
You Are Not Behind. The World Is.
Medium · AI
Career choice with the advent of AI - pure Computer Science or learn software with a background of core engineering area
Dev.to AI
The AI Hype Cycle: Calm Before the Next Breakthrough?
Medium · Programming
🎓
Tutor Explanation
DeepCamp AI