Using Synthetic Data for Machine Learning & AI in Python

DataCamp · Intermediate ·📐 ML Fundamentals ·2y ago

Skills: ML Pipelines80%Supervised Learning60%

Key Takeaways

Using synthetic data for machine learning and AI in Python, including creating and assessing synthetic datasets for privacy-preserving machine learning, with tools like DataCamp Workspace and Mostly AI.

Full Transcript

foreign hello everyone and thank you for joining today's live training my name is Rhys and I'm going to be your moderator for today's session we're going to kick off the session in a couple of minutes or so we're just waiting so everyone has a chance to join uh in the meanwhile though let us know where you're watching from using the uh chat or the comments depending on where you're watching and yeah tell us something you'd like to learn from today's webinar um just to note today we are using datacab workspace so if you don't have an account already please sign up for one uh we'll also be sharing a link so you can code along with us as well we'll also be uh generating synthetic data to use alongside the live training um there are comments uh directing you towards that and yeah we're going to be doing that with mostly AI today I will be posting uh links in the chats in the comments so everyone can do that but yeah sign up for workspace and yeah sign up for AI account foreign foreign and thank you for joining today's live training my name is Rhys and I'll be your moderator for today's session we're going to kick off the session in probably about 30 seconds a minute or so we're just waiting so everyone has a chance to join uh in the meanwhile though uh we'd love to hear from you so let us know where you're joining from using the chat or comments depending on what platform you're watching from and yeah tell us something that you'd like to learn from today's webinar uh just a note we're using datacamp workspace uh for the code along today so if you don't have an account already please sign up for one we're also going to be using mostly AI to generate synthetic data there's a link to uh to uh sorry login to mostly AI today um to generate the synthetic data you want to use the AI ml training um topic so yeah use that but I'll be also be posting that in the chat I'll be back in about 30 seconds or so and then I think we'll be uh we'll be ready to go to go foreign hello everyone and thank you for joining today's live training my name is Rhys and I'm going to be your moderator for today's session we're going to kick off the session in well pretty much straight away we've just been waiting so everyone has a chance to join uh two things to note today we are using datacad workspace and mostly AI uh for our synthetic data generation and for the code along so if you don't have accounts for either of those please get signed up now you'll have a short short period to get sorted on those at the beginning of the session but please uh yeah if you want to join in please get that done as soon as possible brilliant I think that's everything from me uh I believe I will hand over now to our host for today's session Richie Richie please take it away hi there data scamps and data champs this is Richie I see we got a lot of people from around the world good to see a nice Global audience uh so today we are looking at how to use synthetic data for artificial intelligence so that means we're making up data and on the face of it that sounds a little bit like cheating however it turns out to be an essential technique for maintaining data privacy and it's widely used in building finance and Healthcare in fact working with synthetic data is an important technique to know about for anyone working with sensitive data in machine learning or AI and our guest today is Alexandra Ebert the chief trust officer at mostly Ai and she's an expert in data privacy and responsible AI so she works on public policy issues in the emerging field of synthetic data and ethical Ai and she's also the chair of the IEEE synthetic data expert group and the host of the data democratization podcast uh so all in all a true expert in the field and with that I shall pass over to you Alexandra thank you very much Richie for this very warm welcome and just watching the chat here I have the feeling I did a entire world tour so cool to see people join from so many regions of the world so welcome and and really great to see that you're interested in learning more about synthetic data as a privacy protection technology for machine learning and artificial intelligence so today what we're going to cover is the question that we already popped up in the chat what is actually synthetic data is it cheating is it helpful for privacy we will figure it out so first we will have a short intersection and then we will dive into the Hands-On part of today's training so as rice already said in the beginning make sure that you both uh have your workspace account off open as well as sign up to the most di platform to generate synthetic data yourself you will also find readily synthesized data sets in the workspace so if you don't want to sign up or something doesn't work out you're already set to follow along but I would say that's a good thing to prepare while I'm giving you a quick introduction to synthetic data Maybe rice if you can share my slides so that everybody can see I mean you already know the title you signed up for for this tutorial webinar on using synthetic data for machine learning in AI in Python and as I just said oh sorry I need to go to the second screen so that I can actually click along we're going to cover three main parts today first is the quick introduction so that everybody is on the same page What synthetic data actually is because the different types of synthetic data and then the Hands-On part we will have two different things first we're going to answer one of the most pressing questions whenever data scientists and data folks first start to work with synthetic data is our synthetic data actually truthful is it accurate is it as useful as real production data so that's the first thing we are going to do in our workspace on data camp and the second thing we're going to look in sort of as a bonus exercise is smart imputation with synthetic data which is a quite nice feature that can help you to fill missing values that you oftentimes have in real world data sets so that's more or less the outline as mentioned please make sure that you have both the workspace open as well as sign up to the mostly I platform so that you're ready to go once we are at that part of our tutorial now we will give you a quick introduction to non-synthetic data and one thing that's important for me these tutorial is really for you so I want you to ask as many questions as possible whenever you want Richie and rice will help me to moderate the chat and whenever urging questions pops up let me know and there will also be some dedicated sections throughout the workshop where I can answer your questions so if you have anything that you're curious about please let us know in the chat with that said why are we actually doing this webinar on synthetic data why is it so interesting as a technology and here I want to share two quotes one comes from the joint research center of the European commission that looked into synthetic data for over two years and then last year published a report where they concluded synthetic data is becoming the key enabler for AI in both business as well as policy applications in Europe and also in many other parts of the world and also Gardner for example states that already next year 60 of all AI training data is not going to be real data but synthetic data so with that quotes we can already guess it's a quite impactful quite important and also quite hyped technology at the moment but to better understand why synthetic data is needed I always like to give a little bit of context on why organizations are currently interested in replacing their production data with Anonymous synthetic data and the problem that many organizations particularly large organizations face is that they have significant amounts of data significantly more data than they had five years 10 years 15 years ago but due to stricter privacy regulations and some ethical and Regulatory risks that come with using this data most often organizations are not in the best place to actually use this data or make it available to the data science teams AI teams to use for machine learning development or for other parts of data-driven innovation so we have this kind of data fuel crisis where it's a mountain of real world valuable data but organizations only use a tiny fraction of this and this is actually something that's not going to get easier in the future many of you will I assume have heard of the European General data protection regulation in short gdpr or California's consumer Privacy Act or one of the I think 120 plus different emerging privacy laws that we have around the globe and this is something that is only going to get more complex alone in the European Union we have this what privacy Pros fondly called tsunami of new regulation that is upcoming the AI act the data act data governance act and this is actually a similar picture to many other parts of the world so particularly for large organizations but of course also for medium Enterprises startups smes it's not going to get easier to use privacy sensitive real-world data and this is why synthetic data is so interesting but once you kind of state this problem of having this challenge of having Treasure troves of real data but not being able to use the data while complying with privacy laws oftentimes the question pops up well why not just anonymize your data and in theory that's a great suggestion because those of you who know gdpr and other privacy laws in more detail might are aware that there are specific sections and those laws that explicitly exempt Anonymous data but the problem is Legacy optimization technology so referring to masking obfuscation and the like they simply don't work in the era of Big Data anymore as you can see here on the screen I think it's pretty obvious that this data is not sufficiently anonymized of course with the unstructured data the image here it's super easy for us humans to see because we are visual creatures but the same holds true for structured data like the financial transactions that you can see here and by the way our tutorial today is focusing on structured synthetic data so we're not going to create images we're going to create Anonymous synthetic structure data but here we can definitely say not enough information was deleted the Privacy is not protected here but what happens if I delete more information is this now Anonymous and I can't see anybody now so I can't make a raise of hands but you will be surprised this is actually still not Anonymous even though the majority of the value in this data was destroyed because all of this Legacy and optimization Technologies are quite destructive in nature and researchers actually figured out that regardless of how much you delete with traditional organization Technologies if there is a tiny bit of real data in the a not supposedly Anonymous data set then you still have this re-identification risk that your privacy of your customers your employees your citizens is not sufficiently protected and this of course brings you in conflict with gdpr and other privacy laws to make this a little bit more concrete let's look at a specific example one study for example found that with credit card transactions it was sufficient to have three credit card transactions per customer and if you think about your own credit card transactions then I would say we can assume that every customer of an organization will have at least a few dozen credit card transactions of a bank in that kind in that in that context at least a few dozens of credit card transactions I would say are reasonable most often you will of course have hundreds of transactions per customers and if already three out of these hundreds of transactions are sufficient to re-identify over 80 of your customers then you can see this destructive nature and why traditional anonymization Technologies are not fit for purpose in the era of Big Data anymore and don't give you the data quality that you as a data scientist want to have to develop your machine learning model and they also don't give you the privacy and the interesting part about this study was actually not even the entire information was needed to re-identify these 80 of customers just the date of the transaction and the merchant nothing more so here you can really see the limitations of traditional optimization and this is something that not only holds true for financial transaction data you can see the same for healthcare data demographic traffic date and any other type of Behavioral or time serious data which is so unique and Rich that it's really hard if not impossible to anonymize with traditional anonymization Technologies and to kind of highlight what does this mean for organizations we already talked about today organizations don't only have a handful of attributes per customers but they have hundreds if not thousands or ten thousands of attributes per individual customer at least if you're working in an Enterprise context but with this Legacy organization Technologies you have this hard ceiling you can't retain more than a handful of attributes before entering into this re-identification risk and this brings up this tension between wanting to utilize data and having to protect privacy and to solve this tension this is why we have ai generated synthetic data for privacy protection so now to the interesting question what is synthetic data why do we need it and how can it help with privacy protection and even though I already said we are going to focus on structured synthetic data think financial transaction Healthcare data telecommunication Mobility data and the like everything that fits in a table I like to explain the concept of AI generated synthetic data with images because it's just easier to comprehend the specific images you see here as you could have guessed are not real people and in the era of Dolly 2 and mid-journey and so on it's not surprising to see the stunningly accurate photos that were made by Machine learning algorithms but a few years back this is already a little bit of an older study from Nvidia this was really fascinating to see how good you can get with AI generated synthetic data and what it did in this research project was to train a deep learning algorithm on Plenty of real human photos up to the point where this algorithm really understood how does a human face look like something like okay humans have two eyes which are roughly positioned in the middle of the phase mouth the nose these hairstyles the skin Shades and so on and so forth and then once everything was learned about out the patterns the structure the correlations in this data set you could use the Deep learning synthetic data generator to create new artificial synthetic images from scratch like the ones you can see on screen and all of those people have never existed before and it's not a kind of very simple process where you just take the pair of eyes from training sample a in the data set and the mouse of training sample B in the data set and just Shuffle it together and say proudly okay this is now my new phase it's really generating and drawing these phases from scratch based on the statistics and the patterns that were learned from the original training data and this is the same approach that you can use for structured data to make sure that you get a privacy preserving synthetic replica of your real world production or customer data like Financial transactions and we at mostly I have actually developed the platform that by the price of two or three buttons allows you to create a highly accurate highly statistical representative synthetic replica of your original customer data set that doesn't have any privacy sensitive information in there and again the process is the same as what I just described with the Nvidia example first let's imagine a large bank has already a huge data set let's say 20 million customers and their financial transactions but of course this is highly privacy sensitive data that's not free to use under gdpr and other privacy laws so if they want to anonymize it with synthetic data and unlock it to make it freely shareable usable for machine learning um shareable on on clouds Resources with startups and so on and so forth the process would look like that first they have the original training data and then they put it into a synthetic data generator in our platform in this example and here the Deep learning algorithm that's part of the synthetic data generator is capable to automatically learn all the correlations the patterns the structure of the entire data set so to simplify the algorithm basically understands how an organization's given customer base acts and behave and then again in a completely separate step once the training and the learning was completed you can generate an arbitrary number of new synthetic customers and their synthetic Financial transactions and if you look at those two data sets the real world privacy sensitive data and the anonymous synthetic data from a statistical point of view there will be nearly indistinguishable but there will be privacy safe so why newly indistinguishable of course if you want to protect privacy you will never be able to retain 100 of the information that's simply not possible from a privacy point of view but in contrast to Legacy anonymization Technologies where you stick with the original data and try to delete and distort and Shuffle around those parts of the data that you deem to be re-identifying let's say a social security number or your last name or something like that and then end up with what I like to call a Swiss cheese of data from your let's say 200 columns you only have like two three four columns per customer retained with synthetic data you don't touch the original data you only learn the patterns the correlations the distributions and then create a new synthetic data set from scratch where you again have all the 200 columns populated with accurate synthetic data but there's no one-to-one relation between any synthetic customer and a real world customer which is how you protect the privacy and to make all of this privacy protection happen it's actually super important that you not only have a powerful deep learning synthetic data generator but actually also powerful privacy mechanisms in place that make sure that everything that this deep learning algorithm learns is generalizable statistical information and nothing that falls into the realm of personal privacy sensitive secrets so let's for example think of a data set where you have we said 20 million customers and there's one Bill Gates in the data set who has a significantly different amount of income spending and so on the and all the other ones this person for example wouldn't get included so the extreme extreme extreme outliers you wouldn't find in your synthetic data set to protect the privacy but in contrast to Legacy anonymization Technologies you can retain the majority of data you can retain basically the entire distribution in your data set minus the extreme outliers which is something that gives you significantly better accuracy so to kind of visualize this of course not a highly scientific graphic but just for your impression in contrast to this Legacy and organization Technologies where you can oh sorry where you can only retain this handful of attributes before you enter into severe re-identification and privacy risk you can finally grow with the amount of data that you collect and although you will not get 100 of the information that you can retain you will actually get near a perfect data which is as good as your production data and which many organizations already today used to develop and train the machine learning models on because it is a very powerful full replacement for the data to kind of sum up what the benefits of synthetic data are first I think and I hope it's obvious for now it's eliminating the Privacy risks so once it's synthetic there is no way back to the original data you can't re-identify your customers if synthetic data is properly synthesize with all the Privacy mechanisms in place another benefit of synthesizing data is the speed we work a lot with large organizations we work with some of the largest U.S banks some of the largest European Banks and insurance providers and they all tell the same story if they want to access data it's something that takes them ages if they're quick a data science team sometimes gets the data within a matter of weeks but much more common we hear time spans from like three months six months or even eight months until they get their data if they for example want to externally share it or put it on the cloud or something like that so this is something that's super cumbersome if they rely on Legacy anonymization Technologies because they have to go through this case by case anonymization process with synthetic later I mentioned it already it's the click of three buttons it's fully automated and it's something that can speed up data access from weeks or months to a matter of a few hours or business days or some organizations even provide an internal synthetic data Lake Hub Marketplace however they call it where teams can proactively access synthetic data without having to go through this lengthy data access processes so it's really a tool also to democratize access to data and speed up data access then accuracy of course is another benefit because if you cannot only retain three attributes per customer but can retain all the 200 or 500 or 10 000 attributes that you had then you of course can be much better in personalizing for your customers understanding what you really want and not only developing tools that cater to the average chain and John Doe but the full diversity of your customer base so it's also quite an interesting technology not only to help you personalize and Foster customer understanding but also to be fair and more inclusive because you can finally also see who are the minority groups in your data set in your customer base and what types of services a product would they might enjoy another benefit and another reason why many organizations turned out synthetic data is to collaboration aspect there's so many large organizations out there that want to collaborate with startups or smaller Partners but of course data sharing externally is oftentimes a challenge so with synthetic data again this is something that can be significantly accelerated and just speeded up to make sure that they are much faster not only validating but also collaborating with external partners and then there are also some other things that you can do with synthetic data to not only replicate the existing data in a privacy preserving manner but actually use the power of generative AI to improve the existing read-world data one of these examples would be smart imputation where you feel missing values but there are many many more than I can actually point you to some of these areas later on when we are in our Hands-On tutorial the last slide we're going to cover is the use cases What synthetic data is used for and then we are off to our workspace to get coding and getting the Hands-On section so use case wise you can really think of synthetic data as an enabling technology the entire purpose of synthetic data is to anonymize data in a manner that still retains the utility of your original data but protects your customers or your employees privacy and the main use case that we see for synthetic data today regardless of whether it's Healthcare banking Insurance retail or even public sector is machine learning because no other use case requires the sophistication of data and this granularity of data but it's also used for analytics it's used for digital product development for external data sharing as I mentioned earlier collaborating with startups AI vendors and so on even open data sharing and increasingly also for responsibly AI aspects like AI governance AI fairness and explainability so one takeaway for you synthetic leaders and enabling technology and it's super interesting to make machine learning in a privacy preserving manner possible with that said I think you think you might good or I hope you have a quite good introduction to synthetic data and maybe you want to take a few questions now if there are some urgent questions to answer Richie I'm not sure I couldn't monitor the chat and then we can actually enter into the workspace part sure yeah so we have a few questions from the audience already and for anyone else who wants to ask a question please uh do that now so first question comes from uh Karan saying uh could you encrypt data on a bit level to anonymize it instead of creating synthetic data so basically the gdpr for example is quite clear that some things like pseudonymization or encryption don't count as anonymization so as long as there's a possibility to come back to the original data it doesn't count as anonymized and out of scope of gdpr therefore the answer in that case would be no but of course maybe there's some other jurisdictions where there's a different answer with gdpr which is kind of the strictest one it's not a possibility I suppose once you start publishing the results you don't want to publish something encrypted you're going to want to publish like a real number as well that would be another thing and I think one interesting aspect maybe some of you are familiar with the whole kind of set of emerging privacy enhancing Technologies homomorphic encryption secure multi-party computation Federated learning in some context what one benefit of synthetic data is is that the output is something that can not only be digested by machines but actually also digested by interpreted by analyzed and used by humans and the thing that's a super important part not only for the usability but also from a fairness responsibly AI perspective would really make sense if we have humans being able to take a look on the data and for example checking if it's an adequate representation from A diversity point of view all right fantastic and it's thoroughly answered and so next question comes from uh dipanka uh saying how do we decide which model is best for a particular data set and business use case for example single file structured data set you can either use uh copulus that's a great question so I think this is particularly interesting if you want to build something yourself or make use of the open source tools for example synthetic data world is one of the most commonly used open source libraries where you have different models available with tools like mostly I you already have a variety of different models inside the product and it automatically picks the best model depending on the data structure so for example time series data needs some other elements as if you only had static data but this is the nice thing about tools like ours that this completely works out of the box and I also quoted for example The Joint Research Center from the European commission earlier they actually compared open source Solutions with our solution and they found that with open source Solutions at least to date there's quite some pre-processing that needs to go into the data I think they figured it was two person month versus again a few click of a click of a few buttons and you have some data that's super super close to your real data so you can build it yourself but I think it would also be interesting to try out tools like mostly AI all right uh we've got we'll take uh maybe one more question I would say one more question and then let's synthesize some data dive into it yeah exactly all right so what's different between synthetic data and scrambled data so basically with scrambled data my understanding would be that you just take your data set make this move and then have some nicely shuffled around data so with synthetic data as I mentioned it's not related to Legacy anonymization Technologies where you either Shuffle around data or delete some parts of the data but always have to read later with synthetic data you learn very granular the patterns and the structure of the original data and then create new data from scratch think back to the images if you were a super skilled artist and you would learn how to draw realistically looking human hand you would be capable to draw something where every human being will tell you yes this is a human hand but it doesn't have to it will not have existed before and this is the same thing that you can do with structured data so you will really create new data points from scratch which statistically are representable representative but you don't take the original no date and just Shuffle it around and this also brings another benefit which means that you're not tied by your original input so we have customers who for example upload 2 million five million 10 million whatever customers and they maybe create 10 million customers or just 100 000 or just one hundred thousand college students if they want to do a certain analysis or product development on them so you're not tied to your original input size you can really scale this up or scale this down because you don't use the existing data pieces so there's quite some differences to Legacy anonymization and shuffling and scrambling around data perfect I would say we go to our workspace and get our data set so I hope that everybody managed to oh we are still on the slides I think I need to share my browser so that you can see something so here we go perfect let me make this a little bit bigger I hope you can see it now so this is our workspace that we created for today's webinar and the first thing we are actually going to do is download our data set we are today working with the adult sensors data set I think many of you already know this data set stemming from the U.S census and what we want to download now is our training sample so as you can see here we've already splitted the data set into a holdout sample and into a training sample plus what we also prepared in case something doesn't work with you during synthesis or you don't want to go to the platform at a readily synthesized data sets that we're going to use for the first part in the second part of the tutorial but for now the only thing that you need to do is download the file and go to our platform if you manage to successfully log in you should see a screen that's quite similar to this one let me know if it's not big enough I can definitely zoom in and you can then synthesize your data set so it's as I mentioned just the three clicks process you basically drag your file here you of course can also use other data sources but for ease of use we just use the CSV file and you don't need to change anything so it's automatically detecting um what uh what variables we have here and what data types we have in here and it's automatically in the background applying all the perfect combinations to make sure that you get the best accuracy while having the strong anonymization so we're basically launching the job and it tells me it's possible this is on the free version text size a little bit just for people on small screens yes of course so I think now we can't see that it's in progress but basically what I just did I uploaded the file and then pressed the launch button which we can't see here anymore because I already launched the job I'm wondering now it's actually not okay this shouldn't happen so I need to work with the small thing but you're not missing anything here basically what's currently happening is that we see the pending status up now it's retracting and putting this in here now I should be able to zoom so this is now going to take um just I would say like two minutes three minutes something like that so while we're waiting we can maybe take one or two more questions or I can also walk you through maybe just in the interest of time I will walk you through what we're going to do with our synthetic data set once we have it so let's give the platform a little bit of time to synthesize all of your data sets but basically once we have the synthetic data set we're going to download it and then we want to answer one of the most pressing questions that everybody has when they start out with synthetic data is the synthetic data actually accurate is it accurately reflecting the patterns the structure the correlations that I have in my real world data because if it's not truthful if it's not accurate if it's not high quality then of course you wouldn't be able to use it for analysis for machine learning and so on and so forth so one way that's very commonly done by organizations to evaluate synthetic data quality is actually to perform trained synthetic test green this is an approach that was referred to in multiple papers and here you can see the setup I'm going to zoom in a little bit I hope then it's large enough remember we already have our census data set splitted in our holdout data set this is not touched by the platform where we currently uploaded the training sample to synthesize and this is also not going to touch later in the machine learning process just to evaluate so basically we have our holdout sample which we keep for later and we have our training sample the one that we just downloaded and put into our platform to anonymize then as mentioned in I hope two minutes or something like that we should get our synthetic data set and then we want to know how good is the synthetic data set how accurately is it reflecting everything that exists in the actual training data set and to figure this out we are actually going to train two machine learning models one machine learning model on our actual training data and one machine learning model on our synthetic data and then we want to see how good those models perform by scoring them on the actual holdout data so data that was needed seen by that model nor by that model new during the synthetic data generation process so that's the setup of trained synthetic test reel and you can find additional information in this notebook so we already went through this process let's see how we are doing okay this looks a little bit slow currently maybe because we have so many people on the platform okay I think I will proceed because usually this should already be uh much further along but sometimes if you have so many people on the platform at once it gets a little bit slow so I'm going to show you how to download the data once you have it but for this uh at this point in time if your job isn't finished maybe you're more lucky than I'm in your shop is already finished uh then you can definitely proceed with the data that we have in the workspace already pre-loaded so I will show you also here but basically what you can do once you have um successfully finished your job you just have to press this button and click on download your CSV file and then you need to unzip it and for to make it work in the workspace you also need to rename it just to kind of demonstrate you what you can also see on the platform is an overview of the accuracy of the data set you can also see um first kind of QA report where you can assess the quality of your synthetic data it's just the very first kind of Sanity check how close you are so here on the left is our original training data on the right is the synthetic data set that we just created and here we have the overall correlation Matrix where you can see that just from a visual point of view it quite accurately captured the correlations that we have in the data set you can also look into the univariate distributions and here I hope it's big enough here you can see you can hardly see the gray line which represents the real data because the synthetic data so accurately matches this data you can also do this for the bivariate distributions you can also look into some privacy checks but these are just like on top because the actual privacy protection already happens during an prior to the generation process so these are just some distance measures to kind of be extra sure that the result is anonymous but of course all of this if you're a data scientist is just the very first sanity check it's much more interesting to do this trained synthetic test real evaluation that I just explained earlier because machine learning is so sensitive to all the Deep underlying patterns that were captured so let's maybe open the data set that we downloaded if I'm just going to demonstrate it here let's see okay it's currently in progress but not finished I want us to have enough time for questions and also for the smart imputations so I'm just briefly showing you what to do so basically you just unzip this file I'm doing this now on my second screen you can't see it but it doesn't matter because if you haven't yet finished your job you can definitely proceed with the data set we've pre-loaded in the workspace so basically two things to do now if you're one of the lucky ones who already have a finished synthetic data set you only need to unzip it and then you need to rename it because to make the workspace to make our notebook work we want to make sure that it has synthetic data set in there so basically now you have renamed this file and if you have your synthetic data set already ready you would just go here to upload into folder upload it into our data Camp folder that we created for today's tutorial if you don't have your syntactic data set yet then this is absolutely no problem because as mentioned we've also pre-loaded it so here you will find the sensor synthetic demo demo data demo data set and you can definitely just proceed by renaming this one I'm not going to do that because I just uploaded a synthetic data set but if you just delete the demo part and have census synthetic.csv then you can proceed with this work book so now we have our real data set there we have our synthetic data set there and we have our holdout data set here and this brings us actually to first importing our data and then one thing that you can do with synthetic data is just explore it and here you can see already from the structure it's the identical structure then the real world production data so in contrast to Legacy anonymization you don't get this swiss cheese of data set where you have all these gaps and holes and can only get I don't know two three five attributes but you get all the attributes that are available in the data set and all of this information is filled out and you can just Generate random samples and already get a feeling for the data uploaded to exploration tools analyze it and so on and so forth then you can definitely also make use of the nice AI feature that we have here and for example let you let the AI show you randomly sample in that case woman of age 30 or younger that have a master degree or a professor or something like that so whatever you want to see from your synthetic samples and this is by the way something that's also quite interesting not only for data analysts but also for product development teams who sometimes want to have super realistic data to populate their products with so this is something that you can do or you can even plot something so what I prepared here is a plot where we can see the average age depending on the marital status and the gender we want to sort this from lowest to highest and also label the average age so let's see if this works jumps around a little bit I think it takes a little bit until here it is wonderful here we have the plot Maybe maybe we ask it to color it a little bit less stereotypical but it's already super nice you can see the average age of people that never married that are separated that are widowed and so on and so forth so usually people that are widowed are already older and it's basically um possible for you to explore the state on a quite granular level look into individual subjects or individual records or also of course look at the overall statistics and for example get things like average age of widowed people versus married people versus whatever so quite nice particularly with this AI tool that you have here in workspaces but this trust is a kind of first exploration what we wanted to do was doing our machine learning so this is already prepared we are basically using a small and fast like GBM classifier and what we want to do here with the adult census data set again I'm assuming that most of you know it but we have different attributes in this data set and one attribute is the binary column income where you can see where better Target Rack or where the records have actually a lower income of below fifty thousand dollars per year or above fifty thousand dollars um dollars a year and what we now want to do is let our model predict if a person is a high earner and earns above 50k so this is basically the code I'm just scrolling through here because today it's not about creating the model but actually figuring out if synthetic data is good or not and then we're going to do uh the first part what I explained earlier we're going to train this large GBM model on our real data and we're going to test it with the real holdout data so if you scroll here and maybe zoom in in a little bit so that it gets higher so basically we can now train our model on the real data that we also use to synthesize data and then we want to also evaluate our trained model on the holdout data so you can definitely follow along I hope everybody is here so far and we didn't have any problems but if there are some big problems then I assume Richie would have warned me so I hope everybody can follow along here so what we're going to do we train the model on the real data and we want to evaluate our model with the code that was um visible briefly above with the holdout data to score and evaluate the area under oh sorry I did I do wrong um I think it's model TRN instead of model Tran thank you didn't do I think it was model train I up here trend so talking and coding is a challenge okay so now we have our accuracy we have our area under the curve and we also have the plot where you can find the code above just a brief explanation of this plot but I would say from the area under the curve and the accuracy these are quite nice numbers but of course the main thing that interests us here is comparing the accuracy in the area under the curve with the model that we're going to train on synthetic data in a bit but just here what we can see is actually all the holdout records that were scored and how they were classified versus what they actually were so here we can see um in that case the closer to zero the more certain our model is that this is a low income person versus the closer to the one the more certain model is that this must be a high income person and here you can see what the holdout records actually were so here with the high income persons the model was super sure that they are high income and we can see with the orange color they actually also wear high income individuals so they're classified quite nicely also here on the other end of the spectrum with the low income folks we see that the majority of real hold out records it predicted to or it classified to be low income actually way low income which is also good to see just a few high-income misclassified as low income and here in the middle we can also see more or less a 50 50 split so this is just to visualize the performance of the Model A little bit but the interesting part of course now is can we meet these numbers with our model that was uh trained on synthetic data and we of course haven't trained it yet so let's do that scrolling up a little bit oh okay so we're going to do the same thing again our synthetic model you want to train the model on synthetic data this time and we want to also evaluate the model again on our holdout data set execute that and here we can again see the scores and I'm not sure about you I definitely didn't remember the number so let's quickly scroll up a bit but our accuracy is in that case 88.6 and here we have 89.1 so we can say that's pretty close that's quite nice and our area under the curve is 93.7 and here we have 93.8 so we can say that the performance is basically on par or super super close to each other so that's already our first kind of part of the tutorial and the main thing I wanted to show you here is a valuable approach how to evaluate synthetic data quality of course this becomes much more interesting if you test it with one of your real world data sets or if you do it with a data set that you work a lot with because then you can see how close you can actually get with synthetic data so that's just the approach and you can definitely recreate this with your own data sets but I would say we continue with our next part by synthesizing the smart imputation data set because this again will take like two or three minutes hopefully and in the meantime I'm going to take questions so just as a transition the second part I want to show you is how to close gaps in your data set and what we actually did with the adult census data set that you already used and that we already uploaded to the platform we semi-randomly included some missing values to reflect something that we can quite often see in real world scenarios where you have a data set but unfortunately there are missing values in there and this is something that on the one hand can impact you when training a model because you need to prepare the model to make sure that it can handle missing values but it's of course also something that you always have to take into account when doing some analysis because it's hard to know what uh which customer segments for example didn't provide certain information being this I don't know many different attributes or one thing that that we also know from organizations is wants to start to capture new attribute but they of course have 10 years or five years or whatever history of customer data then some part of the customer base will have this information others will not or all will have this type of information just from a certain point in time so smart invitation is something that can help you to fill this out and in contrast to more basic imputation methods you can get much closer to the kind of ground truth of data I'm going to explain this in more detail in a bit but I just want to get the synthesization centralization started so that we can continue and then I'm going to take questions both on the smart invitation if you already have some but definitely also on our train synthetic test real accuracy evaluation that I just showed so what we need to do to make the smart imputation happen is basically just change one tiny bit in our platform and we again are going to upload the data set that we already used for the first synthetization run and then you can see here all the different settings and already when you upload the data set you are in your data settings and what we actually sorry I didn't mention that the column where we change the missing values was the last column in the data set the age column we also sorted it to make age become the last column of the data set because currently that's the only way you can use it on our platform in the future doesn't matter in which order the columns are you can then be more flexible in smartly imputing the data but here with this data set you have some missing values in the H section and what we want to do is Select now this column and activate the smart imputation press save and again launch our job so in the meantime also our first synthesization finished I hope for you too so you can definitely also perform this with your synthesized data set later if you currently just use the prepared synthetic demo data set to see if there are some differences in the predictive performance of the models but basically now we're going to synthesize the same data set but we task the generator to fill out all the missing values that we have in the age column so I'll let the generate to do its job and will take a few questions now maybe Richie if you can shoot a few at me until we have our essential we have absolutely loads of questions from the audience it does not have to get through um okay maybe just as a kind of advanced note I'm super happy to to answer your questions so you can also follow me on LinkedIn send me a LinkedIn message you can even reach out or just go on the mostly I website and put your question in the chatbot there so you will definitely get an answer since we only have nine minutes left and can't answer everything I think I will answer two questions now and then proceed with the smart imputation all right perfect uh so uh next question comes from Naz saying uh how can AI vendors differentiate themselves if they use the same puts um if you think the same uh basically how do you create different outputs given the same inputs I mean in that case uh the input would be um different because synthetic data in this scenario is not generated out of thin air and you just tell a generator please or maybe let me start differently if you use mid journey to generate your marketing images it's quite likely that everything is going to look more or less the same we will have the same style and that things are going to get repetitive in this context of generative Ai and in this context of synthetic data what you do is basically not create synthetic data out of thin air but you use your proprietary existing data as an organization which means that the synthetic data you work with is specific to your organizations the results from analysis that you get specific to your customer base and therefore I think it's not that much a question about differentiation different input because it's just the data that you already collected so I would say this this is not that r

Original Description

80% of AI projects fail, and more don't even start due to privacy constraints. This is where AI-generated synthetic data comes in. It's an anonymization technology seen as the key enabler for artificial intelligence. Rewatch this training to discover what synthetic data is, how it protects privacy, and how it's being used to accelerate AI adoption in banking, healthcare, and many other industries. You will create a highly representative synthetic dataset yourself, learn how to assess its quality and use it for privacy-preserving machine learning. And as a bonus exercise, we'll look into smart imputation with synthetic data to save you time on data pre-processing! Key Takeaways: - Learn when synthetic data can be helpful for protecting privacy. - Learn how to create synthetic datasets. - Learn how to assess the quality of synthetic datasets. Code along with Alexandra on DataCamp Workspace: https://bit.ly/3CYIsgv Generate synthetic data using MOSTLY AI - Use the ‘AI/ML training’ set: https://synthetic.mostly.ai/ Explore the rest of DataCamp's Webinars and Live Trainings at https://www.datacamp.com/resources/webinars

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DataCamp · DataCamp · 0 of 60

← Previous Next →

SQL Server Tutorial: Date manipulation

SQL Server Tutorial: Date manipulation

R Tutorial: Intermediate Interactive Data Visualization with plotly in R

R Tutorial: Intermediate Interactive Data Visualization with plotly in R

R Tutorial: Adding aesthetics to represent a variable

R Tutorial: Adding aesthetics to represent a variable

R Tutorial: Moving Beyond Simple Interactivity

R Tutorial: Moving Beyond Simple Interactivity

Python Tutorial: Why use ML for marketing? Strategies and use cases

Python Tutorial: Why use ML for marketing? Strategies and use cases

Python Tutorial: Preparation for modeling

Python Tutorial: Preparation for modeling

Python Tutorial: Machine Learning modeling steps

Python Tutorial: Machine Learning modeling steps

R Tutorial: The prior model

R Tutorial: The prior model

R Tutorial: Data & the likelihood

R Tutorial: Data & the likelihood

R Tutorial: The posterior model

R Tutorial: The posterior model

R Tutorial: An Introduction to plotly

R Tutorial: An Introduction to plotly

R Tutorial: Plotting a single variable

R Tutorial: Plotting a single variable

R Tutorial: Bivariate graphics

R Tutorial: Bivariate graphics

Python Tutorial: Customer Segmentation in Python

Python Tutorial: Customer Segmentation in Python

Python Tutorial: Time cohorts

Python Tutorial: Time cohorts

Python Tutorial: Calculate cohort metrics

Python Tutorial: Calculate cohort metrics

Python Tutorial: Cohort analysis visualization

Python Tutorial: Cohort analysis visualization

R Tutorial: Building Dashboards with flexdashboard

R Tutorial: Building Dashboards with flexdashboard

R Tutorial: Anatomy of a flexdashboard

R Tutorial: Anatomy of a flexdashboard

R Tutorial: Layout basics

R Tutorial: Layout basics

R Tutorial: Advanced layouts

R Tutorial: Advanced layouts

Python Tutorial: Time Series Analysis in Python

Python Tutorial: Time Series Analysis in Python

Python Tutorial: Correlation of Two Time Series

Python Tutorial: Correlation of Two Time Series

Python Tutorial: Simple Linear Regressions

Python Tutorial: Simple Linear Regressions

Python Tutorial: Autocorrelation

Python Tutorial: Autocorrelation

R Tutorial: The gapminder dataset

R Tutorial: The gapminder dataset

R Tutorial: The filter verb

R Tutorial: The filter verb

R Tutorial: The arrange verb

R Tutorial: The arrange verb

R Tutorial: The mutate verb

R Tutorial: The mutate verb

R Tutorial: What is cluster analysis?

R Tutorial: What is cluster analysis?

R Tutorial: Distance between two observations

R Tutorial: Distance between two observations

R Tutorial: The importance of scale

R Tutorial: The importance of scale

R Tutorial: Measuring distance for categorical data

R Tutorial: Measuring distance for categorical data

Python Tutorial: Plotting multiple graphs

Python Tutorial: Plotting multiple graphs

Python Tutorial: Customizing axes

Python Tutorial: Customizing axes

Python Tutorial: Legends, annotations, & styles

Python Tutorial: Legends, annotations, & styles

Python Tutorial: Introduction to iterators

Python Tutorial: Introduction to iterators

Python Tutorial: Playing with iterators

Python Tutorial: Playing with iterators

Python Tutorial: Using iterators to load large files into memory

Python Tutorial: Using iterators to load large files into memory

SQL Tutorial: Introduction to Relational Databases in SQL

SQL Tutorial: Introduction to Relational Databases in SQL

SQL Tutorial: Tables: At the core of every database

SQL Tutorial: Tables: At the core of every database

SQL Tutorial: Update your database as the structure changes

SQL Tutorial: Update your database as the structure changes

Python Tutorial: Classification-Tree Learning

Python Tutorial: Classification-Tree Learning

Python Tutorial: Decision-Tree for Classification

Python Tutorial: Decision-Tree for Classification

Python Tutorial: Decision-Tree for Regression

Python Tutorial: Decision-Tree for Regression

Python Tutorial: Census Subject Tables

Python Tutorial: Census Subject Tables

Python Tutorial: Census Geography

Python Tutorial: Census Geography

Python Tutorial: Using the Census API

Python Tutorial: Using the Census API

R Tutorial: A/B Testing in R

R Tutorial: A/B Testing in R

R Tutorial: Baseline Conversion Rates

R Tutorial: Baseline Conversion Rates

R Tutorial: Designing an Experiment - Power Analysis

R Tutorial: Designing an Experiment - Power Analysis

R Tutorial: Introduction to qualitative data

R Tutorial: Introduction to qualitative data

R Tutorial: Understanding your qualitative variables

R Tutorial: Understanding your qualitative variables

R Tutorial: Making Better Plots

R Tutorial: Making Better Plots

SQL Tutorial: OLTP and OLAP

SQL Tutorial: OLTP and OLAP

SQL Tutorial: Storing data

SQL Tutorial: Storing data

SQL Tutorial: Database design

SQL Tutorial: Database design

Python Tutorial: Introduction to spaCy

Python Tutorial: Introduction to spaCy

Python Tutorial: Statistical Models

Python Tutorial: Statistical Models

Python Tutorial: Rule-based Matching

Python Tutorial: Rule-based Matching

Learn how to create and use synthetic data for machine learning and AI in Python, protecting privacy and accelerating AI adoption in various industries. Create a synthetic dataset, assess its quality, and use it for privacy-preserving machine learning.

Key Takeaways

Discover what synthetic data is and its benefits
Create a synthetic dataset using DataCamp Workspace and Mostly AI
Assess the quality of the synthetic dataset
Use the synthetic dataset for privacy-preserving machine learning
Explore smart imputation with synthetic data for efficient data pre-processing

💡 Synthetic data can be a key enabler for artificial intelligence, overcoming privacy constraints and accelerating AI adoption in industries like banking and healthcare.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Pipelines

View skill →

Building a Dog Breed Identifier App from scratch - DogNet

Building a Dog Breed Identifier App from scratch - DogNet

Aladdin Persson

Complete Dockers For Data Science Tutorial In One Shot

Complete Dockers For Data Science Tutorial In One Shot

Part 6 | Deploy ML Model on Kubernetes | Auto-Scaling with HPA and Monitoring with Prometheus

Part 6 | Deploy ML Model on Kubernetes | Auto-Scaling with HPA and Monitoring with Prometheus

Abonia Sojasingarayar

Vertex Pipelines: Qwik Start

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Automate R scripts with GitHub Actions: Deploy a model

Related AI Lessons

Stop Overfitting With Basically One Line of Code

Learn to prevent overfitting with a simple code tweak and understand the difference between Ridge and Lasso regression

Stop Overfitting With Basically One Line of Code

Learn to prevent overfitting in machine learning models with a simple code tweak and understand the difference between Ridge and Lasso regression

Medium · Machine Learning

Stop Overfitting With Basically One Line of Code

Prevent overfitting in models with a simple code tweak, understanding the difference between Ridge and Lasso regression

Medium · Data Science

Stop Overfitting With Basically One Line of Code

Learn to prevent overfitting in machine learning models with a simple code tweak, comparing Ridge and Lasso regression techniques

Medium · Python

Learn Deep Learning by Hand (Beginner's Guide - Part 1)