Hands-on with A/B Testing #abtesting #datascience

Analytics Vidhya · Beginner ·📐 ML Fundamentals ·3y ago

Skills: Supervised Learning80%ML Maths Basics70%ML Pipelines50%

Key Takeaways

This video covers A/B testing, a crucial component of machine learning deployments, and demonstrates how to design and analyze A/B tests using statistical significance, confidence intervals, and hypothesis testing. Tools and techniques such as z-tests, pooled variance, and variance reduction are also discussed.

Full Transcript

hello and welcome everyone to another session in the data R series we are thrilled to be here with you this evening for a session full of action-packed learning I am Abhishek Kumar Singh part of the data science team at analytics Vidya uh for those who have joined us for the first time a brief introduction to the data sessions uh the data is a series of webinars conducted by analytics Vidya and led by top industry experts it's a fun way to understand the concepts of data science from the leading players in the data Tech domain and as the name suggests it's one hour dedicated to data we are hopeful that the sessions are going to be a great source of enrichment and value adding for our community members so now on to our session today which is about Hands-On with a B testing uh so a b testing is crucial but less talked about component of machine learning deployments which ensures that we release changes incrementally to get an approximate estimate of the effects of the change before you expose a larger audience to its impact this data R will cover why we need a b testing how to perform it and the basic maths behind it we'll also discuss pitfalls to avoid and some rules of thumb that you can follow I hope you are excited uh to attend this data or the data are with us before we kick things off and I hand it over to our beloved speakers a quick recap of the housekeeping items we are recording the session and the recording will be available on our YouTube channel or you can find the links in the chat section please use the Q a section for asking any questions that you might have during the session and we will do our best to answer them as the data processes or towards the end lastly we will share a feedback poll towards the end you all are requested to kindly fill that up before leaving the session uh coming on to our speakers in this session of data we have Mr Kapil Kumar and Mr Ravi Kumar with us uh Ravi Kumar is a data scientist at credit where he currently leads the NLP vertical he has worked extensively in NLP deep learning Edge Computing graph neural networks and large scale model deployment he has over five years of experience in vertical across the fintech domain he completed his b-tech in CSE from IIT guwahati there he co-founded his startup build blocks which helped real estate business with networking and Marketplace uh you can find the links of our speakers also in the chat section uh coming to our next speaker uh Mr Kapil Kumar uh who is a data scientist who works on recommendation and personalization at credit prior to this he has worked on recommendation NLP and speech at dailyhunt and Joseph and has over four years of experience in this field he completed his p-tech in CSE from nit Delhi he participated in gsoc twice where he worked on ML projects with organizations like incf and obf over to you Ravi and Kapil the virtual stage is all yours now good evening everyone thanks for joining us on Friday night so before uh so let me quickly start with the every test we'll start with a very uh a kind of introduction on the IB test and some kind of a terminology we want to address before we directly go to the hand Zone thing uh so let's start so the first thing is uh whenever we talk about diabetes is the very Less Talk across the industry even in the courses like in colleges or other places we don't talk about heavy testing but evidence is something like which is widely adopted across the industry like Google Facebook anywhere you will go you'll find a b testing going on so before going to the what is heavy testing I just want to address a few points on why a b testing okay so why a b testing is basically whenever we are going to launch a feature or product a product or feature can be a very simple like changing a color of the button launching a new kind of a recommendation model in YouTube or Instagram or Tick Tock anywhere right so what we want to know is if you want to know the gain confidence on the feature what we are releasing and what is the expected thing we can expect from okay from the feature itself so first thing what we are we people look into whenever if I am going to launch a feature so what I'll be looking on to keep what is the CTR I'm getting on that right if I'm launching a new recommendation model how many people are coming and looking into that whether users are engaging to that feature or not whether these are satisfied to that product or not if let's say you're launching a new kind of feature obviously and the feature can be like a buggy kind of people also use evidence there to roll out the feature so that for the listen rollout they come to Nokia okay my user engagement is stopping there is some bug in the feature itself let's go and debug that so people industry or company are using those things the third part what uh is one of the critical things which basically is a data scientist you should understand about is the causal impact okay so what is causal impact is basically is not correlation let's say you're seeing a left in something you will you similarly see a lift in another Matrix that becomes a correlation but if you want to debug that why this lift has come what feature actually has uh has basically uh which features basically contributed to this slip that becomes a causal impact okay so addressing those things is very important because if some feature is really important to us you want to make that feature more prominent to the user the fourth part is inside generation people do render maybe testing not random I will say but with some hypothesis they do uh a b testing just to get dirt Insight of the user like for example uh in Google pay we get rewards cashback right there can be a thing where Google started doing a first related thing instead of giving cashback they started giving the offers they must have done the a b testing they actually house user instead of getting the cash back aren't they really happy with the cashback uh with the with the purpose they are getting for the merchant or something so this kind of data driven insights can be generated from that okay the fifth point is the uplift what we're looking that's a user engaged to a particular let's say recommendation model but the cost of release of that maintaining the recommended model is pretty high okay so at the end we want to before launching that product to the 100 of the user base we want to Define that okay this is the update what we'll be getting let's say 10 percent uplift of the current CTR and the cost going to be a million dollar for that so that at the end the high level can decide okay whether we want to launch this product with 100 or not okay so this basically uh keeps a metric inside what is the profit at the end the company will be getting okay the one of the famous terms industry people use it is the hippo hippo basically is like the people who are basically generally the manager who are getting highest paid right high paying jobs right they come with a cognitive bias okay and they think like whatever the assumptions will hold Pro in this current scenario but if you have the data why not prove that whether it's really going to work out or not so that's why people use a b testing at closed industry and widely adopted so let's move to what is a b testing so in general or in a very limit term what else is comparing any two groups distribution of any two groups one is the control that is so not given any kind of treatment you have not given the feature leads to them and another group which has basically seen the feature or the recommendation of whatever features you're launching they've got the treatment so what we want to do is we want to compare uh those two groups and understand like what actually happening there okay uh currently if you see in Google Facebook or Netflix in any company they are more than 5 000 of experiment running on any day okay that's how the uh scale of every experiment has been adapted to other companies so what they are trying to Define with this so many experiments they want to define the aesthetical significance of that experiment that is with some number not by this intuition with some number they want to say key okay this is the lift we can get launching this feature okay and the AV experiment is not new it is uh is being happened in a very uh two Century back like 1835 for a drug trial uh so every experiment is not nothing new but yeah the algorithm the concept how we are using it that has been matured by now okay so there are multiple types of a b testing right now one is a sequential which basically help us in launching new feature new rollout uh step by step then there is a bootstrap then there is a split split is basically where we divide the group into two or in in groups okay and then give the treatment one treatment two or treatment three and keeping one is a control and defining the lip on that so in this handphone we are just talking about the split thing and uh will not be covering this pencil in the bootstrap so before going ahead in the hands on Arabic testing I just want to uh explain some few pretty cassettes sometimes so be with me so first thing is the main main variance in standard message it's a very pretty basic thing so mainly simply the average of all the data we have observed the value the variance is the spread of the data what we have observed that is how far it is from the mean and the standard deviation is simply the root of variance then I also want to explain CLT that is Central limit theorem so basically in central limit theorem why we use this thing is basically let's say you are having some unknown distribution you don't know what the distribution is but and the number of samples there is let's say in millions and you want to Define like okay what is the average expenditure for that user let's say in Facebook you want to know about the what is the average time the user is spent in Facebook app okay so how you'll Define those things right if you have just given a humongous data and you don't have the capacity or the computational power so how you'll do that so what Setter limit theorem says is like P okay don't worry just take a sample of thousand users take their mean and simulate keep simulating that okay let's say 100 times so what you'll observe that the same the sample distribution mean will be equivalent to population mean okay and the standard deviation that is the standard error that is the how different it will be from the sample mean to the population mean can be defined by standard deviation by root of number of size of that sample and the third interesting part what we we see here is like if the sample size is greater than 30 and you do end up sampling the normal distribution for the mean what we'll get will be a normal curve so here I can show you a simple simulation for that okay so it is a simple simulation for the central limit where we are sampling from a simple a non-distribution basically okay and we are just sampling here and what we are trying to do is we are increasing the first decreasing the number of simulation and also decreasing the sample size here so if you see here uh I'm just increasing this sorry yeah so if you see here the curve is like having a very high standard Aviation and not very matured but if I'll increase the sample size here okay and increase the number of simulation you can see that our standard deviation is pretty gone down and we have better curves here so that's what central limit theorem is doing here so let's go to the next uh next part so so after CLT one of the critical things what we have is whenever we do the a b testing right you need to have some kind of a hypothesis so whatever the feature you are launching it you have to have some kind of hypothesis like what kind of effect it will do at the end and defining hypothesis is not easy okay the industry has grown and defined multiple things to uh to basically capture the impact of that feature okay so different Matrix is very critical okay uh for the hypothesis like what kind of Matrix will be looking for for the any feature in general across the industry people use heart okay for launching any feature that is happiness engagement adoption retention task success what does it mean like happiness is basically a proxy for user satisfaction okay engagement is like okay how uh how user is getting engaged to that feature like if the people is coming to that how many times on a daily on a daily level it's using that feature the adoption is basically the new user who is which has been seen who has been exposed to this feature How likely the person will adopt to that Future Okay the retention is basically uh if the people have used this month will the people will come back again the next month okay the task success something that we can Define is what is the revenue I'm getting by launching this feature so this becomes a kind of indicators if I launch a feature this kind of indicators can help me understand like okay these are the impact what I'm getting it and we can think about some kind of a hypothesis okay so choosing the right Matrix should be very correlated to the treatment like if you are looking for a let's say a recommendation model and you're launching a brand new uh YouTube recommendation model in the home page so what you will be looking is the CTR the people doing through some video uh has been increased by how many times or has been flipped by a five percent ten percent that kind of Matrix that we're looking for there are multiple pitfalls in that okay so let's say in search in I will give you an example there let's say you're uh building a searching algorithm okay uh for a Bing or Google and you said like okay I'm going to launch a new uh search engine and that will increase our CTR in that obviously if the seater will increase the revenue for the search engine that is being in Google ad will keep on increasing but what you have done is that uh you have made a paid search engine basically whatever you're searching the results are coming into second or third page so basically you've increased the CTR the ad Revenue will increase for the short period of time but there's user satisfaction is not is going for a toss and your retention will drop that quickly by the next month or after one week right so that creates a problem so creating a hypothesis just because of one metrics is not good okay while creating a hypothesis right when while giving the treatment we Define two type of hypothesis one is the null hypothesis and one is the alternative hypothesis so null hypothesis means that there is no effect in the controlled budget treatment let's say in treatment you have learned some feature so hypothesis are creating is that okay that this group of people and this group of people will behave exactly the same so well in the alternative hypothesis uh we say like okay there will be some effect either it can be positive or either it can be negative okay the objective of Av experiment is to reject the null hypothesis okay so what we Define is that key whatever the hypothesis we have they are not same that part we are trying to Define in the ab expected okay so uh uh these uh there are a few more times we we want to address uh that is the statistical significance okay so what is the statute of significance is like what we want to observe is the two group right we want to find like which one is going better right but we also want to Define like being that the community the users are growing in nature right so there can be a chance where you see the uplift just by matter of luck or chance okay so we want to avoid those things whatever we are observing is not by luck but actually the treatment has made that effect and we are seeing some update in the treatment group okay so we are trying to Define here two two kind of terms here one is the p-value and one is the significance level so P value is basically whatever the result we are looking into okay how extreme is the observation when the null hypothesis is true so what does it mean mean is so what is null hypothesis like there is no change in the group okay so it says like the probability of getting the lift when the conditional probability that is the null hypothesis is true okay which basically Define is whatever the value we are getting okay whatever the values we are getting that is not by luck but it's actually the treatment thing but we have observed the alpha is basically what we say is the a kind of a hyper parameter it's a significance level which I will cover in the latest slide but p-value is this okay and we accept the null hypothesis whenever P value is less than Alpha and we'll also go to how we calculate the p-value that will be part of the notebook with the p-value right we have what we said like okay we are going to accept the null hypothesis okay but the point is key whatever treatment you are giving to a particular group it's not enough to get a it's the number is just not enough T you have got a lift you are going to launch the product you also want to know about what the confidence you have for that result what you are observing okay if you want to define a confidence okay I am 95 percent confident on that key the same result will be observed if I do simulation 100 times okay so first thing is confidence level it's pretty simple it's 1 minus alpha alpha we know uh that is uh static statistical uh level which is subtract by 1 minus L and we get 0.05 that is the confidence level okay so so what we do is keep wrote it down in the percentage that is 95 percent so confidence interval stays like okay uh I'll explain you with the graph that will be much easier so if you see here right so we have 95 confidence level so yeah so if you see the graph here okay it's a normal distribution graph with mu at the center and mu minus standard deviation uh plus minus 10 division is the uh uh pink red part then mu minus plus minus two standard vision is the blue part and then the green part okay so when I say 95 confidence that means he if I'm sampling the data okay from this normal distribution I will be getting the data between mu minus 2 standard deviation to Mu plus 2A standard deviation the data will lie between these two range with 95 percent confidence okay so confidence and confidence level becomes 95 percent and mu Minister we send it to Mu plus 2 standard deviation becomes the confidence interval for that this curve okay so that's what we're trying to define the confidence interval here that is the mean plus minus 2 into standard deviation that is the range of values okay and we are 95 confident that whatever the result we are expecting that will fall between the students with 95 confidence okay so now the next part comes uh now the next part comes is the errors so whenever we are modeling something or whenever we are doing some kind of experimenting everything is prone to some kind of error okay in a b experiment we Define this these errors into two parts one is the type one error which you can see here and one is the type 2 error that is here false negative and type one error is the false positive and here is the alpha that is the statistical level we have defined before so what it is it is the type 1 error so if you define the type 1 error what it is it's a false positive that means that if you have given the treatment right uh if you if if if if you've not given the treatment right if you have not given the treatment but still the people have used that let's say you have not changed the color but people is still clicking on that okay that becomes some kind of a false positive okay and that becomes a kind of a error which we which is biased merely by luck okay the people have used that feature by luck or by mistake but actually not by your change that becomes a type 1 error that we want to avoid okay the type 2 error is basically uh is is basically the false negative which basically defined that uh whatever the treatment you are giving the people is not onboarding to that that becomes a kind of type 2 error because it's not coming to that and we don't care about that it just help us in defining the sample size but type 1 error is very important to us and we defined as a 0.05 yeah in the next graph I'll just show you or explain you like uh how what is type 1 and type 2 error is actually is so if you see that we have two curves one is the blue curve and one is the red curve so blue curve is basically the distribution of the first group which are controlled and the red is basically the the curve which basically got the treatment it can be anything right the type one error is basically the part which basically got converted but uh it's just by luck they've got converted and type 2 are those parts which have not got converted but uh uh which has not got converted but has got the treatment so we don't ignore the type 2 but we we but we need to care about a type 1 error at every moment so here I'll just show you like how the curve moves in the notebook works so if you'll see here we have a significance level of 0.05 here which we have defined before we can we have as a simulation here that is a control main we can change the control main anytime okay this is a treatment which we are seeing the laptop let's say by 10 percent uh that's negative 68. yeah 68. if you see the control standard deviation right I'm just decreasing it a little bit increasing it and treatment standardization also increasing it okay and if I see that uh our size is so whatever we are observing is basically the random so basically we can't Define here is our treatment is doing better or not okay let me do a simple thing like this and let me increase the standard deviation here okay thank this and let's make it yeah so it will see here if you see here this is okay yeah so if you see here this is the part uh where we we call it as a type 2 error and this is becomes a type 1 error if you see the mean here which is greater than the static of significance we call it a kind of T we can accept the alternative hypothesis if you take that null hypothesis okay uh so I am the uh couple you want to go ahead and stop setting my screen sure cool yeah so uh one more thing uh you're like over here is basically the type of experiment that we are going ahead with so uh for example uh let's say you want to measure a effect of a drug on a user on a user called now that drug can affect user positively also and the negatively also so in that case we what we want to do is we want to measure both extremes of the data so for example uh let's say the mean of the your uh the effectiveness of your drug is let's say 0.9 and uh the additional uh the new drug that you introduced that would be your treatment if it's performed very badly it would lie somewhere here it's statistically significant and we can say that the treatment is not working or it is bad affecting value similarly if the the treatment is working uh pretty good then the effect of that treatment basically the new uh Effectiveness that we are getting it would lie somewhere here so in case of let's say uh serious experiment where we want to measure both scenarios basically either it is positive impact or the negative impact in those cases what we do is we are going for a two-tailed test so what what two-tailed means is basically we are keeping I on both extremes of the curve so even if there is a positive result we will accept it as an alternate hypothesis and even if there is a negative result will exactly the alternate hypothesis it's just that the definition of alternate hypothesis whether the result is positive or negative that will change but coming on to some a different kind of experiment for example uh you your company has built a feature and you're like you're mandated to uh release the future or let's say uh it is pretty sure that you are going to go ahead with the feature but you just want to measure the impact of that feature so what you can do is you can go ahead with a one-tailed experiment and just make sure for example in this case uh you are just making sure that okay I have a new feature let's say you redesigned your uh your home page and I'm pretty sure that I'm going to launch it and there are certain few features that I'm going to launch ahead with it I just want to test whether if it is performing bad or if it is performing at par with the data so in this case we are only looking at the left table so basically these kind of experiment where you are only concerned with one kind of result and the second kind of result doesn't matter in those case you choose a one-tailed test so basically how to decide whether you want to choose a one-tailed two-tailed test just take a scenario that okay what will I do if there is a positive impact what will I do if there is a negative impact what will I do if there is no impact if if you're uh if the if your answer for negative impact and no impact is similar then you are looking for one two if your answer for negative impact and positive impact and the no impact is different then in that case where you are going to go ahead with a few table test foreign thing that we are looking for is like what is the complete procedure or what is the experiment flow uh from start of the experiment launch of the experiment getting the analysis of the experiment and making the final decision on top of it the first step of this experiment flow is the experiment Center so in the experiment setup uh someone comes with a hypothesis okay you build a feature let's say you build a recommendation algorithm and you want to test whether your recommendation or algorithm is working better or not whether it is increasing the CTR or not so that is your hypothesis okay so your hypothesis would be something like okay this feature will increase my metric now the other things that you want to know at the point of experiment sector is what is the statical significance that you are chasing what is the uh false positive uh like uh rate that you are affected like let's say for example uh by default usually people use 0.05 so what is the value that you are satisfied with then you have to choose the beta value also so what is the uh false negative value that you are comfortable with then comes the list that you are trying to observe so uh choosing the left is again very crucial because sometimes it happens that the lift is so small that it is not feasible to launch the feature and using all of these things what you come up with a sample size basically the minimum number of users that you want to put in each variation so that at the end of the experiment you get your result and it is statically significant yeah someone to use the hand question okay uh cool I will move it so uh the next next part of the experiment flow is the experiment launch uh in the experiment launch phase uh there's not much to do but you have to make sure that the split that you have defined okay that 50 are going in the test okay 10 percent are going in the control bucket you don't want that a user opscom control rocket to the test bucket this should not happen similarly uh it should not happen uh that uh this there should not be a network effect for example you do something let's say you're sharing a data and now more and more people are coming to the same post and now you're interacting more with the data so that what what that makes is uh it reduces the uh the IID assumptions that we take while launching the experiment at the post experiment analysis we are interested in the p-value basically what is the final key value that we are getting what is the list we are getting and the confidence interval of that on and the last part is decision making like at the end of the decision uh if you see at the end of the decision if you see uh like if if your left is significant or uh like do you want to launch this feature do you want to re-experiment it sometimes it makes sense to like again start the experiment with a much higher significance level so we'll go uh talk about politics yeah cool so let's start with the sample problem let's say uh you're working in YouTube you have developed a new graph based recommendation algorithm and you want to test it efficiently so basically in this case what would be a good hypothesis okay a good hypothesis or something that is practical and feasible to test like if you make a hypothesis that okay I've made this change and now people will be more happier then there is no practical way to measure them so you want a hypothesis that is practically and feasibility acid so for example some hypothesis can be okay if I show this car to user 10 years down the line user will take a loan or buy a car now it is Impractical like it is not feasible to test this hypothesis so a hypothesis has to be physical practically and Commercial similarly though in this case in this problem statement a null hypothesis would be that introducing the graph these recommendation algorithm on YouTube homepage won't impact the conversion rate of a CTR alternate hybrid will be just the opposite of it and okay if you if you look at the statement that you are making introducing graph is recommendation algorithm on YouTube home page won't have any impact we are not talking about positive impact on negative impact we are talking about any impact so that means we are concerned with both positive and negative impact and that's why this in this experiment we are going ahead with a two-tailed experiment the assumptions uh that we usually take when we are forming this hypothesis is that all the events are individual like all the events are independent and they are identical so it should not happen that for example let's say there is a feature that user can claim and offer only one but if you're showing it multiple times the first time you will claim it but the next event we won't claim it now because he has claimed the first time that that uh event had a consequence on the all the future events that next time you show him he won't do it next time you show it so you have to make sure that the way you are measuring the hypothesis that all your events are independent and there is no network effect Network effect uh for example let's say uh you're on Facebook you created Facebook group and you invited all your friends to okay let's play now your event is triggering a loop where your engagement is increasing and that itself is making you uh let's say come more frequently on the Facebook so the causal impact of that uh feature that we're testing uh won't be measured because there are other scenarios that are coming cool uh looking into the Matrix the the second part is once you have the hypothesis you want to measure look okay so you want to measure the efficiency of your algorithm but what is the metric you select in this case let's say we go ahead with the conversion rate that is proportional metrics for example uh let's say uh although you are launching a new feature and conversion rate is your main criteria but you don't want to affect is the long term goals of your company or your organization that are like let's say user retention or monetization rate or the engagement overall for example uh let's say there is an algorithm that recommends a lot of Click bits so CTR will increase but the retention will decrease with the time people will stop coming to your platform so these are the garden metrics that are like the North Star you don't want to touch them if they improve that is better but if they don't improve and let's see your conversion rate improves then also you have to make decision and more often than not the decision is that the feature can't provide cool so in this uh setup let's say uh we are interested in uh this experiment where we launch a feature on YouTube let's say before launching the feature uh a base conversion rate is let's say 30 like uh user click three videos out of every time videos let's see and the effect that you want to measure is let's say two percent and uh this is the two percent effect that I'm talking about this absolute basically we want to see if after introducing the treatment if the CTR becomes 32 or not okay how do you choose this minimum detectable event like before launching the feature how do you choose this uh minimum detectable effect is it depends on like is it practically significant for example uh for a big organization like YouTube even increasing the Ctrl by 0.1 percent or 0.001 percent it can result in millions of Revenue but for a small organization one percent increase uh if you measure one percent increase in CTR and you roll out your treatment the result if you roll out the feature the cost of rolling out that feature is will not be subsetted by the revenue that you get so basically when you are deciding what should be the minimum detectable effect it should be statistically and practically significant when I say practical it means like it should be feasible second uh third thing is significance level okay what is the false positive that we are expecting like that that we can accommodate not expecting that we can accommodate so for example uh let's say uh for example there is no change from control to treatment but five percent of the times it may it may be possible that the sample that you collected uh from the distribution it belongs to the right extreme or the left extreme of the distribution in that case what can happen is you can falsely tag okay your experiment is working and you can uh roll this out and in that case uh what will happen is once you go to plot uh you won't get the same result basically let's say you your your treatment is actually working but there are chances like just like the control curve similarly you have a treatment curve and it may happen that although the mean of your uh treatment is good enough it is beyond the critical level but when you when you took that sample out of that population it may belong to the left cell the left stream of that curve and in that case you will falsely say okay uh the experiment didn't work uh so what is the percentage of times that you can accommodate that scenario that is 20 and the like one minus beta that you call this statistical power in this case we are selecting it as 0.8 similarly the hypothesis type is a two-sided hypothesis as we discussed earlier and this is the formula of calculating the sample size that is required to test the hypothesis now uh we won't go into the deep or of by how we derive this formula uh the the brief interior is that this formula is derived based on the power uh the the conditioning that we put on the power and the significance level so basically we have this okay we have a treatment and we are making sure that the value of treatment the the new uh statistic that we are getting it should be have this much power and it should have this much significance and so we get a condition and based on that condition uh we get this formula okay uh now the interesting thing to look here in this sample size formula is this is the Delta basically the lift that you are expecting so if you lift let's say is 10 percent if you decrease the lift by 10 times your sample size would increase by 100 times right so the sample size and your left are inversely proportional so it is very hard to detect very minor Improvement and they are quadratically related quadratic nature so for a detecting effect time smaller you will need 100 times more uh cool uh the next part is the post experiment analysis in post experiment analysis what we do is uh we observe three parameters like what is the observed list one the other lift that we before the start of the experiment uh sorry uh just a second I will just move to the we want to show you yeah so I want to show you a sample size calculator specifically for this use case so uh we have created a sample size calculator for both continuous and proportional in this case we are only we are only concerned about the proportional method basically the CTR metric so uh Yeah so basically uh if you look at here our Alpha we have set it to 0.05 beta uh basically the false negativity that we are able to accommodate is 0.20 the Baseline conversion that we are expecting for the Baseline conversion that is currently there is 0.3 basically 30 and the absolute Delta that we want to measure is 0.02 and the experiment type we are set to 2K so if you see uh we get uh okay we need 8286 experiment in this scenario so as you I was mentioning in the formula if you look closely if if I increase the uh uh the absolute Delta sample size would decrease and similarly if I if I decrease the Delta that I want to measure to very small or the sample size that we need will increase a lot so for example for 0.2 we needed 8000 but for 0.01 unit thousand so yeah so basically uh yeah so these are the values that were mentioned in the slide uh for this example we would need eight to eight six samples per variation so let's say we are comparing the two uh the variations like control versus treatment we need to double the samples of this cool so let's say uh b r we have the sample size we have run the experiment we have make sure that uh no user jumps from one bucket to another we have made sure that our IID principles are followed and let's say we are now at the end of the end of the experiment and now what to do how do I confirm that whether my experiment is successful or not and in that case uh there are three metrics that we are concerned but is one is the observe lift that is the actual lift that we're getting between treatment and control the second thing is the significance for that lip Okay you are saying let's say five percent Improvement is there but is that significant for that we are calculating p-value and the third thing uh that we are going to observe is the confidence interval of the lift cool so in the p-value basically what we are trying to find is for example this is my code uh let's say I randomly sample a value from this curve and let's say that value belongs to this data point and what time what my P value says what is the probability getting a data point which is as extreme or more extreme than the current data point so the probability is area under the curve and these are the only extreme events compared to this point so the P value would be the area under the curve from this point to the edge of the Curve so this is what we are trying to observe and uh yeah cool uh another thing I want to introduce uh is the Cs code so what what is z score uh so basically it's a measure of number of standard deviation uh a data point that is observed is away from the mean so for example uh let's say you have been a mean of 10 and uh you have a standard deviation of 5 and let's say you are looking for this number let's say 25. so what you will do is you want to find key how many standard deviation away that particular data point is away from your mean oh yeah so what you do here let's say you substitute mu by 10 your value is 25 it becomes 15 15 by standard deviation which was 5. so what it shows is key your data point let's see the number 25 it is three standard deviation away from the mean and the sign of the Z just tell you on the which side so for example uh if V is positive we are going for this side we're looking at this side which is negative we are looking at this time now if you see both Z value and the p-value that we discussed about both are concerned with a given data point for example let's say that we have a data point let's say here cool yeah so and for this data point we can Define the P value what that means is for this data point the p-value would be the area under the curve from this point to the edge of the Curve and the z-score would be how many standard deviation away this point is from this uh mean so basically z-score and both P value are the two different ways of representing how extreme your observation is they are just two different ways of observing them so for example uh for a P value of 0.05 you you will get a z-score of 1.96 now similarly for 0.01 you will get a much higher resistance uh third thing that I want to like stress on is about the confidence interval a lot of times people they just look at the p-value they look at the list and they decide okay let's go ahead with the experiment uh that is not the recommended way because uh one thing basically the other Delta that we are getting from moving from control to treatment that is again a normal distribution and that again has a confidence interval if it is too wide it means that uh although your current value let's say you are for for a given experiment You observe the left or let's say 10 percent but if the distribute if the standard deviation of that lift is too high then it may happen that when you go to the final stage when you're launching the feature you may not get the actual the exact uh treatment effect that you are like thinking cool so a confidence interval basically a confidence interval is a range of values that are that are more likely to improve your population mean with a certain degree of confidence so for example uh if you're looking for a very high confidence let's say 95 confidence then the range in which your population mean uh remember population mean is uh the mean that you are trying to uh estimate companies uh two distributions so the your population mean uh can lie between that two ranges and uh with a given certainty or given uh confidence level uh one point I want to stress is uh 95 percent what 95 confidence level it doesn't mean that okay you ran experiment You observe this treatment effect and then you see okay uh there might be there is 95 chance that the actual list that you will get lies between these two numbers uh what what it means is let's say if you simulate the same experiment let's say you run this experiment 100 times but each time you can you created a confidence interval of let's say 95 percent then out of those 100 experiment in 95 of course experiment the actual population being will lie within those ranges so I just want to make this clear and uh again confidence interval uh like is uh you have to take the z-score like what is the uh value that uh the Z score that you are getting and you have to multiply it with the standard deviation so this way you can find the margin of error basically like a stream lower Bound in the upper bound of the confidence interval foreign cool so till now what we have done is uh we have defined our hypothesis we have defined our parameters Alpha and beta we have defined our left that we are trying to measure and based on that we have calculated our samples now let's say the experiment has finished and now we want to find the actual list so in in case of uh the two sample tests basically when we are comparing control and treatment we we usually go ahead with the two sample z-test but what it does is basically it is very similar to the z-score so if you look uh the Z statistic it tells me how many standard deviation your data point is away this is the actual observed value this is the mean and this is the standard deviation this formula look a little bit complex but it is the same thing it's just that now we are concerned with two a distribution and the actual X basically The observed effect it becomes the the difference in the mean basically the difference in control and the treatment mean that we are observing interestingly if you see uh there is a zero here whereas uh here we are using the mean that is the main data by by here we have used a 0 here is because when we started the experiment we started the experience with the assumption that null hypothesis is true and what I want to measure is okay given the null hypothesis is true how XDM is this observation so because our assumption is that the null hypothesis is true so the mean of both control and treatment we are treating as C and that's why when we are subtracting the mean of these two distribution we are replacing it in zero coming to the denominator part again this is the same thing is a standard deviation but now because we have two samples and most often they are not we don't know the population variance so in in uh so to overcome this situation where we don't know the population variants because we just took a sample what we uh What uh we do is uh is what we call the pole variance so basically what old variances uh for cold various let's Define a variable called uh old mean what that means is let's say X1 is the number of positive events in the control X2 is the number of positives and in the treatment and N1 and N2 are the number of samples that were assigned to the control and treatment so we get the overall average and this is the variance of the binary binomial distribution and basically uh we divided by N1 and N2 so this is the formula a y we divided by N1 and N2 again because uh the uh when we when we are creating a sample mean distribution uh the actual variable that we call standard error it's defined as the standard deviation divided by the root of your number but because uh in case we are taking a variance we just squared this part so the number become it becomes T into 1 minus P that is the pole variance divided by N1 plus 1 upon the same thing and the standard deviation uh in this case would be the under root of the pole variance so this term we call it let's say pool standard deviation okay uh one assumption uh yeah so one assumption that we take when we do a z test is that we have enough number of samples uh minimum size that is uh recommended is 13 and the another way to represent those condition would both Redemption is that if your number number of users that are present in your control and the actual uh proportion or value that you're getting from the control side if the multiplication of those two are greater than five it's just like a rule of thumb uh to say that okay I have enough sample size and if I sample from the I have enough sample size if I bootstrap those sample created the solution then I will get a normal distribution again uh to uh I will just go ahead with the notebook to show like how we doing it okay so we have calculated sample size uh yeah you calculated sample size so we are saying that we need at least 8286 samples so let's simulate first yeah so for example I have created this function basically what it does is uh given the base conversion the Delta and the number of samples it will just uh simulate the data for example the obvious conversion is 0.3 Delta we are expecting 0.02 and I've kept uh number of sample as 10 000 instead of 8200 just uh because it is always better to make overshoot in terms of your sample size instead of undershooting so let's say let's say uh we ran experiment we ran it let's say one more R and we are getting let's say 10 000 something so this is the data that we are getting for example if this user belongs to treatment or not this is defined by this problem and whether uh whether the user clicked on that okay uh once we have simulated this data if we go ahead and we yeah once you've simulated this data the other conditions that we followed when we started the experiment were one also what was the alpha that you selected it was 0.05 and the second what kind of experiment we had like we had a two-tailed experiment and the number of samples that we got like after the experiment so once we do the for example post experiment analysis let's say we get okay the P value is pretty low let's say uh we were expecting it to uh 0.05 below 0.05 and it is there so it is below 0.05 the test statistic is the z-score that we talked about again the p-value and tested this are same way well if different ways of representing the extremism of the observation and the the third part that I want to stress about is what is the Delta that we actually got for example uh when we created the when we simulated the data we asked it to add 0.02 percent more probability to the treatment set after the experiment what we get is that this is the data that we are getting and this is the confidence interval of this Delta 95 percent confidence interval and the and looking at this 95 percent confidence interval is pretty uh important because this gives us confidence okay at the worst uh case how much lift I can get into this case how much left I can get and uh these uh so basically this this uh looking at the confidence interval is one of the more important aspect of this now let's go to the actual function where we are doing the implementation and if you look at it uh what we are doing here is yeah so again uh just like the same formula what we are doing here is the we are calculating the mean of the boot control and the treatment we are calculating the full variance that I mentioned and then we are taking a square root of that and then that that is what gives us a poor standard error and again uh now this the same thing that we uh showed in the formula basically immune treatment by mean controller divided by the standard error and once we get this value uh if it is a two-tailed test uh then we get the p-value uh you have to multiply it by two reason B because in a two Traders we are concerned about both the left and right extremes and similarly from the p-value uh we can move on to Z political basically Z critical is the uh uh critical value for a given Alpha that we are testing for for example in this case we are testing for 0.05 percent Alpha so Z critical in this case would always be 1.96 okay okay so uh these things we get and then uh we are just getting the Delta basically what is the observed uh impact or Improvement that you are expecting and we are creating the competitive interval on that property okay similarly the confidence interval uh the formula that we showed basically the Z critical value into the old standard error yeah so we are getting the same way like we are adding it when we are trying to look at the positive side of it we are subtracting we are looking for the negative side also okay once we have this data uh we get this all this data and this is the data that we are more concerned about basically the p-value so in this case the p-value that we are observed is 0.0007 which is much lesser than uh 0.2 uh Pi 0.05 and yeah we should uh accept the small level except uh the alternative cool uh let me move back to I think the short and time so move ahead okay so in this case uh our P value was like 0.007 much lesser uh than uh expected that was Alpha that we set to 0.05 so in this case uh we will reject the null hypothesis and we'll accept the alternate angle cool okay uh one more technique I want to discuss about is the variance reduction technique so okay why we need a various reduction so and what is various reduction why do we need that so for example uh let's say for a continuous uh variable this is the z-score that we are concerned about basically the above mentioned formula so this formula is a proportion and this is for continuous variables let's say the the session time the user spend or the amount so for example let's take an example of this variable cool this is the Z statistic okay what we want to do after the experiment is we bond this value as high as possible because because if it if it is very high it will cross the significance threshold and D would be like very confident about the result we are getting right so basically our team is after the experiment we want to maximize this okay if You observe in this formula this Statistics is inversely proportional to the variance uh do someone have any question uh okay cool so uh this uh Z statistics it is inversely proportional to the variance uh that that this this is let's say control variance and the this is the statement variable this is our inversive proportional so one way to increase the Z statistic is if you can find a way to reduce this variance to reduce this variance uh what we have to do is like like how can we reduce the variance of our data that we are getting from the user sector we can reduce it using the formulation that okay there are some uh major variables some minor variables that are affecting the the data that that they're getting for example the treatment data that you're getting uh it might be affected by the let's say the actual recommendation that you are showing to the people on YouTube but let's say it might also be affected by the screen that the person is using let's say he's using a much wider screen he can see the content much clearly and he has a higher chance to keep listening now this this variable let's say the screen size variable this is consistent for the user before and after the experiment but this variable is adding some uh variance to the calculation so what we do is in Cupid reduction technique uh we are taking uh the samples uh the same set of users prior to the experiment same set of users after the experiment and then we are taking the correlation of that metric basically the CTR Pride to the experimental theater post the experiment now because this variable let's say the screen size is all a variable it is present in both free and post experiment we can find the correlation between these metric and use that correlation to reduce the effect of this additional variable so the what all we are doing here is we are reducing the variance here and that will help us increase statistics what that means is in a way like uh we would we would need a little less number of sample size in this case okay uh so okay actually uh for example uh this is the quote I found in one of the paper where like uh Google experiment platform had he said they are not satisfied with the amount of traffic that they are getting even with the 10 billion project because they want to measure such small effect uh and they want to finish the experiment at let's say in a week or two they don't want to extend the experiment So within that period it is much harder to measure a very small Improvement which which might use uh increase in our uh Revenue by millions but the even the Google are not able to test so that's why we need this uh variance reduction technique and the way to do is to reduce the population and the control variable and how do we do that we are using the common variables that are affecting the performance we are taking the correlation of the metric just go ahead by taking the correlation of those metric and then we are substituting subtracting that impact from the uh actual value so this is a mathematical reasoning uh behind this uh if you see uh basically uh so you'll be a short on time I will just skip on it but I'll share a link to the paper that is from and basically the idea is uh if you we can represent the equation in this way you take expectation of this these two terms cancel out and the expectation of the final value that they're getting is similar to Y Bar but if we instead of expectation if we go for the variance we get this variance in this form and here the Theta uh this this variance uh this becomes a linear equation if you look at in in terms of y and x and uh using alternative least Square you can find the variable value of theta that minimizes this and in this case the Theta is covariance of Pi by x divided by variance of X and using this uh we we can reduce the variance of the final output I will just uh go to show you once uh like how it works and uh we will share the notebook Link in the slide so that you can again go into the expect only right yeah for example uh we are simulating uh we are testing for thousand users uh we are dividing the users into fifty percent you are providing the users to control 50 to treatment let's see the control mean is 50. uh the lift that we are introducing is spice and let's say uh before the within the experiment uh there are some random variables that are at that are at play and uh they are like compared to before the experiment if you compare it to after the experiment uh there is some random variable that are impacting both control and both uh state line so in this case if You observe here so uh what I've done is uh just show you one line specifically yeah so for simulating post experiment data what we have done is we have taken the prior data and in case uh if the treatment is set like so we want to introduce treatment we are introducing the treatment link in case if you are not if the user doesn't fall into the treatment bucket we are just adding a random noise and we are adding this Randomness both to the treatment and control what this gives us uh is basically like prior to the experiment uh if you look at it uh our mean was shifting our standard deviation was uh 10 so like variance is the square of standard division which is hundred and the mean is 14 income Let's see we added a left of file uh our treatment Matrix increased by 54 and because we introduced the variance of uh five also basically the random standard deviation its variance the square is 25 so 25 into 100 it will be a year 225 who but if once we apply this Cupid application what we observe is for the same data the variant that we are getting has reduced like compared to this data the variance has reduced so what this would mean is uh using this data if we do the same uh post experiment analysis we get a much higher significant cool uh yeah uh now getting back to some rule of thumb basically this was mostly on the analysis part uh going back to the rule of thumbs uh these are rule of terms basically observed in their industry after years of dealing with the experimentation so shifting clicks is easy the first thing is like when you are doing an experiment one thing you have to remember is for example you are on Amazon you made a feature specifically for TV and you are recommending TV or like promoting televisions what that would impact us basically on the home screen there is limited the area so you will be cutting clicks from some other product definitely for example uh you can test the okay if you created this campaign the sale of TV has increased but this might impact uh the sale of other equipment let's say fridge or something so basically shifting clicks from let's say one item to another that is easy but increasing the overall engagement is hard so for example if you can test experiment within two days then you can do 10 experimental continuous compared to let's say if you are taking uh let's say I mean 20 days so basically if you have a large user base and you use the various reduction technique You can conclude your experiment faster third thing is avoid complex design so like the whole idea of a b testing is to quickly iterate over your uh you have a hypothesis if you test it get back to the drawing board create a new hypothesis so the best way is to like create simple hypothesis make a minimal viable change that is like that that can help you test your hypothesis and the third fourth thing is big wins are real like you you may always uh Target qk I will increase the detail from 50 to 55 or 20 but that is usually not possible and uh this is a famous card by alpacino in any given day uh big wins are rare winning is so focus on small wins like I mean of course uh big brands are good but uh small links are equally well a common pitfalls and misconceptions that you're getting okay so for example uh you are doing the experiment and you should dual key okay you need 10 000 users and uh you compare to okay every day thousand users come to your platform to you will end your experiment after 10 days so we have this urge to regularly look into the data see okay how uh the data is doing be let's say after every day we are calculating the p-value and let's say at the day two or day three let's show you this class so this is the uh P value how p-value of experiment varies uh with the number of days the experiment has been like so for example let's say you launch this experiment and let's say after two days your P value is this you say okay no the P value is low let's stop the experiment so if you take a decision at this moment before the getting the complete sample size you more likely there's a much higher chance that you will end up with a wrong hypothetic wrong test because uh the P value is not mature and it will change every time like with the for example every day you you might find some deviation in your p-value so it's ideally good to wait till the end of the experiment to look in the data if you are keen on looking into the data just don't take any decision okay in day two If You observe a lower p-value let's just uh don't take a decision if if on a day five You observe qk uh now is a result of significant and it is better again don't take any decision this is the Delta and this is the p-value corresponding to that so again uh I don't think and if your Peak at least don't take make any decision based on that let the experiment finish let the numbers come and then uh third thing I want to uh discuss this uh basically some some misconceptions on the uh some misconceptions on the p-value and the confidence interval definitions so the p-value is the probability that the hypothesis is true for example if let's say this is the statement that uh people more often not believe that if the P value is 0.01 then the null hypothesis are only one percent chance of being two this is not the case uh P value assumes that the test hypothesis uh is true and what it says because you get a very low P value what it says is the the result that you are getting is very extreme if it if it came from the null hypothesis so it is just saying that the extremity of the data that you observed is very high if the null hypothesis was true second misconception is uh on the 94 percent confident interval I think we already touched upon it that uh It is believed that if you get 95 percent confidence interval then there's a 95 percent chance that the true effect will lie in that change it is not uh like uh true basically what it means is he if you run 100 experiment and uh can create confidence details for all of them for 95 of the experiment the True Value will lie in the confidence interval uh we'll uh we'll go to the occasionally answering just a slide open third third misconception uh I'll find uh data misconception basically like this is this industry standard that uh Alpha should be 0.05 and beta should be 0.2 these are not golden rules this can be and should be changed based on your uh problem statement for example let's say you are testing a new drug uh the negative impact of it can be pretty high right if you do let's say uh you created a new drug and uh you launched it and if it lies into the false positive scenario then it helps it will have an impact so like in this case we can use a lower false positive let's say we can use Alpha of 0.01 uh cool uh for further reading we have mentioned some resources here uh please uh if you have any questions [Music] I'm answering the first answer in the chat itself thank you okay a link to the AV simulated notebook uh as we are mentioned in the notebook I will just [Music] so um all right guys I hope you guys have filled in the feedback poll if not I request you to please fill in the poll about feedback as it helps us to conduct such more such sessions if you wish to conduct a webinar uh or are facing difficulty in registering connect with us at uh whatever the link is given in the chat sectionality.com also uh the recording of the session uh that was conducted today will be available in one or two days on our YouTube channel or the link is uh posted in the YouTube uh I mean the chat section uh we'll post it again uh till then or is there anything left to ask or to cover extra who to the second last slide okay listen uh any questions also uh coupled so if you could just share the Jupiter or this thing yeah so that we can share it with them I've shared the link in the chat and uh it's the link is also present in the uh all right so always yeah so yeah go ahead and play with the the simulator and foreign thank you so much sir this was really a very insightful session about the a b texting I'm sure our uh like the people who have joined might also have learned a lot of things to take away from here uh so guys we'll be back with another session of the data in like next uh 20 25 minutes or the link is in the chat section uh till then uh bye bye and keep learning

Original Description

A/B testing is a crucial, but less talked about component of machine learning deployments, which ensures that we release changes incrementally, to get an approximate estimate of the effects of the changes before you expose a larger audience to its impact. This DataHour, will cover why we need A/B testing, how to perform it, and the math behind it. We will also discuss pitfalls to avoid and some rules of thumb that you can follow. 🔗 More action pack session here: https://datahack.analyticsvidhya.com/contest/all/ Stay on top of your industry by interacting with us on our social channels: Follow us on Instagram: https://www.instagram.com/analytics_vidhya/ Like us on Facebook: https://www.facebook.com/AnalyticsVidhya/ Follow us on Twitter: https://twitter.com/AnalyticsVidhya Follow us on LinkedIn:https://www.linkedin.com/company/analytics-vidhya

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Analytics Vidhya · Analytics Vidhya · 10 of 60

← Previous Next →

The DataHour: Data Science in Retail

The DataHour: Data Science in Retail

Analytics Vidhya

The DataHour: Anomaly detection using NLP and Predictive Modeling

The DataHour: Anomaly detection using NLP and Predictive Modeling

Analytics Vidhya

The DataHour: Energy Data Science Project from Scratch

The DataHour: Energy Data Science Project from Scratch

Analytics Vidhya

The DataHour: Explainable AI Need and Implementation

The DataHour: Explainable AI Need and Implementation

Analytics Vidhya

The DataHour: Google Cloud AI/ML

The DataHour: Google Cloud AI/ML

Analytics Vidhya

Prediction to Production in Machine Learning #machinelearning #prediction

Prediction to Production in Machine Learning #machinelearning #prediction

Analytics Vidhya

Practical Applications of Data science in Ecommerce

Practical Applications of Data science in Ecommerce

Analytics Vidhya

How to tackle Overfitting?#machinelearning #overfitting

How to tackle Overfitting?#machinelearning #overfitting

Analytics Vidhya

Building Data Pipelines on GCP #googlecloud #datapipelines #data

Building Data Pipelines on GCP #googlecloud #datapipelines #data

Analytics Vidhya

Hands-on with A/B Testing #abtesting #datascience

Hands-on with A/B Testing #abtesting #datascience

Analytics Vidhya

Efficient Implementations of Transformers #transformers #cnn #machinelearning

Efficient Implementations of Transformers #transformers #cnn #machinelearning

Analytics Vidhya

Modern Deep Learning Architecture #deeplearning #architecture #deeplearningtutorial

Modern Deep Learning Architecture #deeplearning #architecture #deeplearningtutorial

Analytics Vidhya

Key steps for Designing Artificial Neural Network (ANN) for Image classification #machinelearning

Key steps for Designing Artificial Neural Network (ANN) for Image classification #machinelearning

Analytics Vidhya

5 things you should know about Azure SQL #azure #sql #datahour #datascience

5 things you should know about Azure SQL #azure #sql #datahour #datascience

Analytics Vidhya

AI & ML in the Automotive Industry #machinelearning #ai

AI & ML in the Automotive Industry #machinelearning #ai

Analytics Vidhya

Building Machine Learning Models in BigQuery

Building Machine Learning Models in BigQuery

Analytics Vidhya

NLP aspects in Telecommunication Industry

NLP aspects in Telecommunication Industry

Analytics Vidhya

Practical Time Series Analysis

Practical Time Series Analysis

Analytics Vidhya

Fundamentals of Quantum Computing

Fundamentals of Quantum Computing

Analytics Vidhya

A DAY IN THE LIFE of a Data Scientist (From waking up to working on algorithms)

A DAY IN THE LIFE of a Data Scientist (From waking up to working on algorithms)

Analytics Vidhya

Classification Machine Learning Model from Scratch

Classification Machine Learning Model from Scratch

Analytics Vidhya

Knowledge Graph Solutions using Neo4j

Knowledge Graph Solutions using Neo4j

Analytics Vidhya

Model Guesstimation (MLOps)

Model Guesstimation (MLOps)

Analytics Vidhya

ETL Pipelines in Google Cloud Platform

ETL Pipelines in Google Cloud Platform

Analytics Vidhya

Key steps for Designing Convolutional Neural Network(CNN) for Image Classification

Key steps for Designing Convolutional Neural Network(CNN) for Image Classification

Analytics Vidhya

Getting Started with AWS EC2 #amazon #aws

Getting Started with AWS EC2 #amazon #aws

Analytics Vidhya

How to Use Azure NLP and Graph Databases for Intelligent Knowledge Mining

How to Use Azure NLP and Graph Databases for Intelligent Knowledge Mining

Analytics Vidhya

Certified AI & ML BlackBelt Plus Program #shorts

Certified AI & ML BlackBelt Plus Program #shorts

Analytics Vidhya

Visualizing Data using Python #machinelearning #visualization #python

Visualizing Data using Python #machinelearning #visualization #python

Analytics Vidhya

DCNN for Machine RUL Prediction using Time-series Data #timeseries #machinelearning #datascience

DCNN for Machine RUL Prediction using Time-series Data #timeseries #machinelearning #datascience

Analytics Vidhya

M in ML stands for Math & Magic

M in ML stands for Math & Magic

Analytics Vidhya

An Unsupervised ML approach using Clustering

An Unsupervised ML approach using Clustering

Analytics Vidhya

Customizing Large Language Models GPT3 for Real-life Use Cases #gpt3 #datascience

Customizing Large Language Models GPT3 for Real-life Use Cases #gpt3 #datascience

Analytics Vidhya

Model Parameters vs Hyperparameters - Techniques in ML Engineering #machinelearning

Model Parameters vs Hyperparameters - Techniques in ML Engineering #machinelearning

Analytics Vidhya

Practical MLOps #mlops #datascience

Practical MLOps #mlops #datascience

Analytics Vidhya

Data Engineering with Databricks #dataengineering #databricks

Data Engineering with Databricks #dataengineering #databricks

Analytics Vidhya

Multi-Objective Optimisation

Multi-Objective Optimisation

Analytics Vidhya

When Airflow Meets Kubernetes

When Airflow Meets Kubernetes

Analytics Vidhya

Analytics Vidhya

Learn Convolutional Neural Network for Image Recognition

Learn Convolutional Neural Network for Image Recognition

Analytics Vidhya

Extracting Value from Data

Extracting Value from Data

Analytics Vidhya

How to measure Marketing Channel Effectiveness

How to measure Marketing Channel Effectiveness

Analytics Vidhya

Transforming Lives | Data Science Immersive Bootcamp

Transforming Lives | Data Science Immersive Bootcamp

Analytics Vidhya

Stock Market Analysis - AI driven approach

Stock Market Analysis - AI driven approach

Analytics Vidhya

Become a Data Engineering Professional in 2022 | Future Trends + Skills Required

Become a Data Engineering Professional in 2022 | Future Trends + Skills Required

Analytics Vidhya

Ensemble Techniques in Machine Learning #machinelearning #ensemble #datascience

Ensemble Techniques in Machine Learning #machinelearning #ensemble #datascience

Analytics Vidhya

The Power of Visualization | Tableau Full Course | Analytics Vidhya

The Power of Visualization | Tableau Full Course | Analytics Vidhya

Analytics Vidhya

Demand for Data Engineers is on the Rise | Data Engineer | Analytics Vidhya

Demand for Data Engineers is on the Rise | Data Engineer | Analytics Vidhya

Analytics Vidhya

Data Visualization in Data Science | DataHour | Analytics Vidhya

Data Visualization in Data Science | DataHour | Analytics Vidhya

Analytics Vidhya

Role of Optimization in Machine Learning & Deep Learning | DataHour | Analytics Vidhya

Role of Optimization in Machine Learning & Deep Learning | DataHour | Analytics Vidhya

Analytics Vidhya

Solving any Machine Learning Problem | Approach and Steps Involved

Solving any Machine Learning Problem | Approach and Steps Involved

Analytics Vidhya

Topic Modeling Explained with Implementation | Using LDA in Python | DataHour by Arpendu Ganguly

Topic Modeling Explained with Implementation | Using LDA in Python | DataHour by Arpendu Ganguly

Analytics Vidhya

Data Engineering in E-Commerce | The Best Case Study

Data Engineering in E-Commerce | The Best Case Study

Analytics Vidhya

Introduction to Classification using Azure Machine Learning | DataHour | Analytics Vidhya

Introduction to Classification using Azure Machine Learning | DataHour | Analytics Vidhya

Analytics Vidhya

Introduction to Federated Learning | DataHour | Analytics Vidhya

Introduction to Federated Learning | DataHour | Analytics Vidhya

Analytics Vidhya

Diffusion Models for Generative Arts | DataHour | Analytics Vidhya

Diffusion Models for Generative Arts | DataHour | Analytics Vidhya

Analytics Vidhya

Master Google Analytics in 1 Hour | DataHour | Analytics Vidhya

Master Google Analytics in 1 Hour | DataHour | Analytics Vidhya

Analytics Vidhya

Learn Hypothesis Testing | DataHour | Analytics Vidhya

Learn Hypothesis Testing | DataHour | Analytics Vidhya

Analytics Vidhya

A Practical Approach to Kaggle Competition | DataHour | Analytics Vidhya

A Practical Approach to Kaggle Competition | DataHour | Analytics Vidhya

Analytics Vidhya

Making AI work for Business | DataHour | Analytics Vidhya

Making AI work for Business | DataHour | Analytics Vidhya

Analytics Vidhya

This video teaches how to design and analyze A/B tests using statistical significance, confidence intervals, and hypothesis testing, with a focus on machine learning deployments. It covers key concepts such as z-tests, pooled variance, and variance reduction, and provides practical examples and code snippets.

Key Takeaways

Define null and alternative hypotheses
Choose statistical significance and sample size
Run A/B test and collect data
Calculate p-value and confidence interval
Interpret results and make decisions

💡 A/B testing is a crucial component of machine learning deployments, and understanding statistical significance, confidence intervals, and hypothesis testing is essential for designing and analyzing effective A/B tests.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Supervised Learning

View skill →

Auto Machine Learning (AutoML) Using AutoGluon

Auto Machine Learning (AutoML) Using AutoGluon

Coding the SARIMA Model : Time Series Talk

Coding the SARIMA Model : Time Series Talk

Code With Me : Logistic Regression (from scratch) !

Code With Me : Logistic Regression (from scratch) !

Machine Learning Tutorial Python - 8 Logistic Regression (Multiclass Classification)

Machine Learning Tutorial Python - 8 Logistic Regression (Multiclass Classification)

Predicting the Winning Team with Machine Learning

Predicting the Winning Team with Machine Learning

Air Quality Index Prediction in Python | Machine Learning Projects | GeeksforGeeks

Air Quality Index Prediction in Python | Machine Learning Projects | GeeksforGeeks

Related AI Lessons

How to Learn a Hard Technical Skill Without Burning Out

Learn how to acquire hard technical skills without burnout by creating a sustainable learning plan

Dev.to · Anas Kalthoum | FreeBrain

After interviewing over 100 ML Candidates. Last Week Someone Walked In and Made Me Take Notes.

Learn what makes a standout ML candidate after interviewing over 100 applicants

Medium · Machine Learning

How AI Learns with Less Labeled Data

Discover how AI can learn with less labeled data, a crucial aspect of machine learning beyond model selection

Medium · Machine Learning

Mastering TypeScript — Understanding the TypeScript Compiler (tsc) from Scratch — Lesson 2

Learn the basics of the TypeScript compiler to write better JavaScript code

Medium · JavaScript

Learn Deep Learning by Hand (Beginner's Guide - Part 1)