The Power of Synthetic Data | Data Brew | Episode 38

Databricks · Intermediate ·🛠️ AI Tools & Apps ·1y ago

Skills: AI Alignment Basics80%LLM Engineering60%

Key Takeaways

The Power of Synthetic Data by Databricks explores how synthetic data transforms AI and ML, improving data access, quality, privacy, and model training, leveraging tools like Gretel AI.

Full Transcript

welcome to data Brew by data Brooks with Danny and Brook the series allows us to explore the various topics in the data and AI Community whether we're talking about data engineering or data science we're going to interview subject matter experts to dive deeper into these topics while we're enjoying our morning Brew my name is Denny Lee and I'm a principal developer advocate here at datab bricks and one half of data brew and hello I'm bro Wenning director of machine learning practice of datab bricks and the other half of datab Brew and today I'm thrill TR yev Meer Chief scientist at gredle the synthetic data platform purpose built for generative AI welcome yev hi it's great to be here thank you for joining us and a quick disclaimer data birs is both a partner and a customer of gredle but we'll chat a bit more about grle later on in this episode but to kick it off your background is in computational neuroscience and you are working with fruit flies from that experience working with fruit flies what are some of the corollaries to modern data science experimentation um that's a great question so yes by background I'm a computational neuroscientist and electrical engineer um and during my graduate work um I was studying how real neural networks process information how they encode that information in the spike domain meaning using Action potentials using electrical signals that are sent down accents to dendrites of other neurons and so we're specifically studying dendritic processing in the Al factory system of dropa melanogaster also known as a fruit fly um and looking back one of the main things uh that was key to a lot of discoveries that we uh made was in being able to run experiments quickly and so one of the reasons we actually work with a fruit fly is because uh it's an extremely well-studied organism uh the Genome of the fruit fly has been known for a while but more importantly they have a gustation period of about two weeks so when I was in grad school I was very happy that I didn't have to work with mice there were people sitting literally at the other bench work preparing mice for experiments and the trouble with using mice is that if you lose an animal it takes a lot of time uh to get another animal to be able to do experiments and so what fruit lies gave us is the ability to essentially move much faster and we were able to design um uh fruit flies for our experiments so we were working we're collaborating with um lab of Richard Axel Richard Axel was a noble laurate he got Noble Prize in 2004 for uh essentially discovering how different receptors were uh encoded in DNA of mice explaining how the sense of smell all faction worked and so with their help we were able to design um fruit flies for our experiments so we would be able to express uh gfp green fluorescent protein and specific neurons go after those neurons record their activity and once you have that data then you can start building models right to try to understand what kind of processing is happening how is information analog information present in the real world how is it encoded in these systems and so what that taught me early on is that uh when it comes to even traditional ANL applications it's really important to be able to experiment very quickly and so I think we'll touch on a little bit more uh but really key to that is the data right like we we haven't really been able to experiment with data in the IML field most people are used to experimenting with um architectures with different configuration parameters with training parameters and I think we're just starting to enter the phase where folks will be able to experiment meaningfully with data design data for their experiments very much how we designed fruit flies right entire organisms to be able to do the experiment that we wanted to do I find it funny because you're saying that it's got a just station period of two weeks and but then right now if you were to tell any data scientist they have to wait two weeks before they can go run their models they probably would like literally have an aneurysm in front of you right there at that point in time yeah but the the reality is that people actually have to wait a whole lot longer to get access to data right if you're working a he grated the industry if you're in healthcare or you finance or somewhere else um there are a lot of Hoops you have to jump through right everything from uh legal everything from admins who might give you access to the data that you want right and even when you get access to that data it's usually extremely messy right so it takes a lot of time to get to the data that you need so actually that twoe period might be better than what a lot of people observe today a ML and and talking about the data at least you're talking about fruit flies and I'm sure even that had a lot of data in terms of what this day and age in terms of the sheer Amo data I take it we can no longer go ahead and get away with hey we'll just Chuck everything on our laptops right so yeah no that uh that's right uh we at the time used pretty big hard drives to be able to record all of that because also the frequency with which you sample that data is really important uh and just the sheer number of experiments we're doing was was huge cool so one of the things that I'm curious about it's like okay which is sort of related when we talk about data processing uh for data science and AI right it seems that it's all about the focus is all about acquiring more gpus more gpus more gpus right that seems to have calmed down a little bit more recently like okay right now it's 2025 January so I'm just providing context to our audience of when we're talking having this conversation so earlier last year it's all about the gpus now people are like recognizing maybe I don't need $8 million worth of gpus right now but then it seems like just what you're implying a teams are not just GPU poor but they seem to be data poor right am I correctly making that presumption here uh that's right we actually called this out last year because there was so much emphasis so much focus on gpus and rightfully so right it was very hard to get access to A1 100s h100s uh right there's huge demand in the market um at the same time though um what most people I think uh not that they didn't realize I think most people just got used to it right having worked in this field for a while um I think what people became accustomed to is having to Wrangle with data quite a bit right so kind of got used to the fact that real world data is extraordinarily messy right that is the case for any company be it a startup or a big company there are bugs being pushed into production all the time these bugs get fixed uh there are schema changes um there are impurities being introduced into data and so people spend a considerable amount of effort cleaning that data uh and so when it comes down to training models themselves especially on Enterprise data uh the reality is that people are uh data ped because of these issues with the quality of data even if you spend a lot of time cleaning the data there's still no guarantee that you have truly high quality data right there might be a lot of data gaps uh that you somehow need to cover for there might be a lot of biases and then manifest themselves in the ml models that are being trained right and this is really the best case scenario the fact that you have data and you can Wrangle with it uh but the other scenario is in some sense even worse right you know that you have data but you can get access to it because of um compliance because you're in a different scam because the data is siloed and there's also the use case we haven't discussed when there's no data to speak of right so that happens very often when you work on a new product there is let's say no Telemetry coming from the existing product maybe it's an entirely new future You're Building zero to one how do you get access to the data how do you generate it um and so so much emphasis has been placed on on gpus but we do strongly believe that a lot of value unlock will happen from people um becoming less data poor as opposed to um less GPU poor um and very much related to this I think is the shift we're seeing in real time to small language models right it is becoming a whole lot easier between the access to computer improving uh and the models getting smaller it is becoming easier to train to find some model to to customize them to your task and so really it comes down to data do you have access to the right data uh and really what we help solve with grle is that data access problem uh removing these data bottlenecks for a lot of the teams that may be struggling and so companies that still find themselves constrained by the either quantity or quality of data at the end of all of the data pre-processing how does synthetic data and other and other techniques like that help to either augment or even entirely replace um human generated data uh yeah that's a great question so um I don't think we're necessarily add advocating for replacing human generated data right and a lot there's been a lot of discussions in the space right discussions about model collapse uh we don't think uh that the model collapse um uh worries are Justified because in practice uh in real world when people work with real and synthetic data we're talking about accumul um an accumulation scenario you accumulate data you don't outright replace all the data that you have you also don't replace it blindly uh you usually try to uh develop uh you you try to augment the data that you have you try to fill in the data gaps because as a subject mattera expert as a practitioner you typically know where the model fails where it's not able to do well and so what you want to do is bring in additional data that would help that model and and of course you would be using real world data and synthetic data um now the reason I think Enterprises uh and and businesses in general are finally waking out the synthetic data uh or at least those that are following are is because you see uh extraordinary results with synthetic data from some of the best llm providers out there right if you look at uh folks from U Microsoft from the Five series of models right the 54 report landed recently but even going back to uh 53 and uh the initial paper textbooks is all you need there is a real world demonstration of how valuable and really invaluable synthetic data is a lot of this uh sorry just to name a few others real quick it's not just the team from Microsoft F it's also teams from coher they work on the is series of models which are multilingual models where they really combine synthetic data with machine translation to develop um one of the most versatile in in in in language um form models uh so it's able to speak so many languages because synthetics were used to fill in the data gaps in real world data same thing uh with respect to uh Gemini series of models right very much emphasizing the importance of synthetic data and more recently right I think we're lucky to be speaking in 2025 um everything that came out from open in respect to 013 um if you look closely at the reports um and the delative alignment right very heavy use of synthetic data that's highlighting the fact that you can't do this actually with human labelers you have to use synthetic data to align the model uh really well um so really there is a ton of opportunity uh with respect to synthetics and specifically developing high quality synthetic data uh maybe connecting this to some of the other conversations happening um with uh Andre kpoi uh and also uh ilio sus I think the difference really with synthetics is that uh it gives you the ability to generate data that is not easily available on the internet right so kathi famously said internet data is not the data that you want for your Transformer and linking this to the 54 report you know one instance of this is if you think about collecting data from the internet let's say you try to develop a model that's really good at reasoning right and you go after math data sets well what happens a lot on the Internet is that you have a problem you have a mathematics problem you have a question and you have an answer right and yes if you train a model uh with that uh the model will pick C on a lot of the patterns it will be able to answer questions and a lot of questions that are very similar to that but what would be much more meaningful to the model is if you didn't skip all the intermediary steps right the actual human when solving that problem went through a lot of steps and those who are really good at it might even skip them right um and really what you want for your Transformer is all all the details in between right and given the attention mechanism the Transformers would just uh eat up that data right like a sponge um and so I think that's what Andre karpati is hitting on and in the 54 technical report that seems specifically calls this out as spoon feeding the model right so like imagine spoon spoon feeding the baby you want to be able to develop highquality synthetic data where you actually don't skip on those steps oh this is really helpful I mean really really insightful but one of the things I I want to I'm going to pretend I'm the uh somebody who doesn't know as much about large language models because that's true okay and the first things first um to provide a little context to our audience we've talked about synthetic data so much but how do you even generate that like what what's the basis of it in the first place because I mean I think we can all agree synthetic data is data that we generated but what does that mean when you say you're generating data and how does that compare to real life data from the St like from the most basic concepts of like the fact that it's not quotequote real data quotequote yeah so it's a great question I think there are many different ways to go about it right and many of us in the field have tried generating uh synthetic or artificial data in the past before we even dive into that something I want to call out uh just to make um to make that point very clear early on when we say synthetic data what we absolutely not mean is fake data right so this is not fake data uh the analogy that we like to use is comparing synthetic data to synthetic oil U so at this point it it is common knowledge though we're transitioning from um internal combustion engines but in internal combustion engine it is really important to be using synthetic oil right and it's not fake oil it is oil that is purpose purpose built uh to help to protect the engine and do so much better than organic oil right because it has fewer impurities in it because it's designed to withstand high temperature and high wear um that is seen and so for for those who are listening maybe not famili with synthetic data we don't mean fake data yes it is artificially generated data but it is grounded in um real world data real world assumptions real world interactions and so what many partitioners have done in the past is uh essentially applied let's say rule-based uh generation right but where if you're subject matter expert uh in your business you're able to generate certain sequences or even pieces of text uh right and obviously geni has changed a lot of this but even before gen people were using LP models to generate text as well and so I would say today when we're talking about generating high quality synthetic data what we're not talking about is just hitting a single llm right so we're not saying hey just grab that llama 405 billion model or grab uh Nvidia Patron 340 billion model uh make an API call and get synthetic data no the the approach is uh much more involved and what we're seeing is that the approaches are essentially um using compound systems right it's actually compound AI is it's from I think that was co uh by folks from data breaks and Stanford from uh from bear uh so a plug for those who are not familiar please go take a look at thank you very much thank you K yes um so yeah at B we are using a compan approach to um generate high quality synthetic data uh and if you look at again reports from Microsoft be the fight team or gemini or even the Quan report that was released recently what is very clear is that uh hitting a single llm is the last thing you want to do because you have a lot of tools you have a lot of models at your disposal and even if you have a state-of-the-art model let's just for now uh talk about just llms even that model is very likely to generate data that uh is not correct uh there might be hallucinations in it data that is not complete right and so out of the gate you want to be doing much more you want to be introducing things like AI feedback you want to be introducing specialized models you want to be introducing additional tools to be able to validate the data right and so all these tools working in concert uh is what allows you to generate high quality synthetic data uh but just to emphasize these tools are not limited to llms because I think when we talk to people today in 2025 when they hear synthetic data they just think oh it's just an llm right I should be able to generate this on my own and the answer is no um and actually generating on your own would be very difficult and very expensive right and again that's where slms for example come into play what we're seeing clearly is that you can leverage small models right and if you do things the right way you can actually um generate much higher quality data than even using bigger models and one surprising findings from the 54 report um is that they're actually seeing even better data than coming out of the model that's used to generate synthetic data uh in the end which is interesting because up until now people have described synthetic data as being essentially data that is distilled from a model but what they see in the final model that they train is that it surpasses the capability of the larger model uh that is in part used in their compound setting yeah those results were pretty incredible from the paper um maybe I have a bit of a bias coming from ANL background but whenever I hear the term synthetic data my mind first goes to smoke uh which nobody talks about anymore but speaking of synthetic data I know one of the challenges is licensing of the data can you talk a bit more as to how Enterprises should think about that when they're generating synthetic data that they then incorporate into a downstream model yeah that's a great question and uh this is being uh um this is becoming really really important not just for Enterprises but for all businesses I think Enterprises are the first to react to this in part because they have resources right there's the legal department and they have all sorts of contracts in place and so in terms of compliance uh and in terms of the even new regulations that are coming out paying attention to licensing is becoming really important and uh I think what also hasn't really helped the field is that it has become very easy for pretty much anybody to generate a new model and just upload it let's say to hugging face right and reality is that even if small companies don't have those legal departments right a team of one or team of two certainly doesn't have the time or the bandwidth uh or uh even the inclination to um do to perform the licensing checks and so what we're seeing today is that there is a huge variety of models out there right you have models that are being offered to users under traditional software licenses like Apache MIT um and you also have a lot of proprietary licenses right so even when we look at the Lama license it is while it is an open weight model uh the the the license itself is proprietary right so for example it is not recognized as an open source license by The OSI um uh Community who are the stewards of Open Source and so um to complicate things even further when somebody takes a model generates data with it right and then trains or finds who the model and just decides to stamp a license on it uh when uploading the hug face right somebody else looking at that model later on may not realize there actually issues right with that licensing and so one of the things that we have done um at uh at gret and I really think it's the first uh use of this in the compound setting is introduce the concepts concept of model Suites and we had to do it essentially because we built everything from the ground up as being a compound a system and in that compound setting because you have multiple models interacting with each other and multiple tools playing together to generate high quality data you essentially have to keep track of all this right so the users can have confidence in using the data that has been generated um and so if you come to Gretel uh one of the things we built in data designer is the ability to specify M Suite so you can say hey like I want this to be generated with naachi 2 uh suet of models or hey I want this to be generated with a lana suet of models uh or whatever state-of-the-art model uh you want to uh to bring um and so yeah keep keeping keeping track um uh of of the pro and Providence of data and models I think is becoming really tricky and we as a um as a as a wider Community have yet to figure out the right tooling to to do this well yeah and it's also difficult because within a model family the license also changes with each release like the Llama models have been gradually getting more and more permissive uh so staying up to date with all this I I find to be a challenge yeah and just one quick call out it's not just the licenses because people pay a lot of attention to licenses but um complimentary to licenses is also the acceptable use policy right and actually when you look at the Llama series of models um the licenses themselves look very similar but the links change and when you look at where the links point they point to different acceptable use policies that keep getting evolved right and so that's also why it's becoming hard for Enterprises to keep track of all of that right because you can't just say blanket stay then you you know you're free to use llama models for example right um because with llama 3.3 for example you can't use them if you're in the EU right if if you're focused on the multimodal uh Vision models um right and so you basically have to keep track not just of licenses but acceptable use policy as well I'm curious because we keep talking about the the the fun of Licensing because now we're all turning into lawyers these days like do you think then this is the will up being becoming like one of the main reasons why to slow down as in essence synthetic data overtaking uh real data quotequote uh in terms of um what people use to train the models because the thing you know what you've discussed here is basically say it's obvious that synthetic data is number one not fake data and number two if we do it right it actually is very useful for us to build these models but I guess what I'm curious about is like is it plausible for synthetic to now overtake real data but presuming the legalities are figured out I think so I don't I don't see I don't foresee scenario where real world data is not being used and again that's because uh as humans as Humanity we're still producing a lot of the data it's just that the reality is that the internet hasn't been around for that long right so uh we're not producing that data fast enough and so the cool thing about sythetic data is that sry it's not just that you don't produce it fast enough you also produce it uh with a a lot of data gaps and a lot of issues and so the beauty of synthetics is that you finally have the levers to pull um to be able to produce better data to fill those gaps but we would be again an accumulation scenario where you would be combining synthetic data real world data I do foresee synthetic data being a much bigger component of things especially as folks really start to experiment with it right so to loop back to our experiment um conversation from the beginning um if I am able to design the for my specific use case right and not just design but irate on it then I'm talking real experimentation and I'm talking uh real uh iteration and Innovation right and there are so many seems that are data poor that are hungry just to be able to innovate and push the state-ofthe-art with respect to their uh specific application with respect to what they're trying to solve and so in those scenarios and really wider consideration I think uh we do see synthetic data becoming a much bigger component of the overall solution where today for most teams it's not it's not even part of the equation right like very few teams are actually using synthetic data and so there's just a ton of opportunity there to meaningfully improve uh models and services and applications so that they are able to perform tasks substantially better so cool so then one of the things that it it naturally needs me to ask the question is like we're we're pretty much all I wouldn't say exhausting but maybe that's a little too strong word but like we're almost exhausting the uh our text Data generated data right so hence the reason for the importance for synthetic data um because again you just like you called out we're trying to fill in the gaps we're trying to take into account of biases things of that nature but I'm just curious what do you think about the the uh the prospect of like of all the video data we have because we have so much of that going on and we've barely scratched the surface actually making sense of that yes there's a lot there's a lot of video data that could be used um many many thoughts on this front so uh first um I'm sure a lot of the labs out there especially when it comes to mul mobile data are already using a lot of video data I think the the copyrights and Licensing there also become a bit of an issue um but uh something to call out um and looking back to kar's comment is unfortunately most of the video my opinion that is out there is not particularly high quality in video uh or um you know doesn't have a lot of meaningful content in it or maybe not have the content we necessarily want to teach the models and so while we do have a lot of data I think the challenge we run into there is uh the quality of data and in some sense the quality considerations are probably uh even bigger than in text Thea that's out there online all right again just because it's so easy for anybody to produce that data right they record a super long video of them just doing like some sort of trick and post it right of course the thing becomes viral uh but in terms of informational content in that video um probably not the thing you want what is cool however is everything that's happening in robotics right um and I think there uh there's been a true explosion in 2024 uh right and so like when we think about all the human robots that are being built super helpful if you're actually able to record a video right through the VR headset of you performing some sort of task right and then that video fits directly into what uh the robot is able to do right and I think there's a whole Revolution happening now in terms of models being able to interact with the physical world so that kind of video yes absolutely but if we're talking about just video that's on YouTube uh probably not what are you telling me the the shuffle dancings that I'm watching on Instagram isn't valuable data is that what you're telling me no that is valuable right oh oh my goodness yeah you're scaring me man Danny I thought you were gonna suggest whether or not the data Brew episodes on YouTube are valuable training D not no no no I I I didn't I didn't want him to give us a real answer to that one so so I know we've talked a lot about synthetic data I would love to dive into first of all what is gret Ai and how does does gredle use grle internally to better improve IT services okay uh yeah great question so um grle is a synthetic data platform purpose built for generative AI um meaning that if you are working in the field and you are struggling to access data to be able to use data in your project uh we want grle to be your first stop right we want you to be able to come in uh and generate synthetic data that is purpose-built for your use case to be able to block here right so again we're in um the D we're essentially solving the data access problem and we're helping you design um data right that that uh element of design I think is completely new to most AI andl teams there maybe not seems that're working at uh llm providers right again the the reports we spoke about clear indication that there are teams entirely focused on um curating and uh designing data but for the rest of us for the rest of the businesses out there it's a daunting task to be able to design on data very quickly um and so that's where Solutions like data designer uh comes into place allowing you to uh drle Navigator and da design allowing you to very quickly put together data set iterate on it scale it up and the other piece that I'm want to come back to is around save synthetics um that's the other pillar of grle where if you have data but the data is is sensitive it has a lot of pii in it and you're not able to use it because you just can't get access to it it's locked down somewhere um Gretel synthetics um offers the ability and save synthetics specifically to generate um essentially a version of that data set that is private by design using techniques like differential privacy uh which guarantee that uh close basically virtually no information can can leak so that individual entities like can be identified in our data and so how does grle use grle internally good question so one of the reasons actually I joined uh grle so a few one is that um I think you know last year on the year before the signs were very clear uh signs and W that synthetics is the future because uh a lot of LM providers are using it because we have regulation coming out that is essentially Tailwind uh for people uh using synthetics and for Gretel uh specifically uh and the the other reason was of course team that I've been working with just been completely impressed with the level of talent at the company but the third one which I think is really important and and really important for anybody considering joining a startup or another business uh is the fact that Gretel uses um its own tooling internally very heavily and some of the best companies out there use their tooling first and foremost right I'm sure data braks uses their own tooling right right A lot has been written around about AWS using its own tooling even before it's released to P public and that's really how you build some of the best products and Lasting products out there and so we do use synthetics quite heavily right because as we build this compound AI system um I mentioned that we're using many different models these include models that are purpose-built for specific tasks these are proprietary models and so being able to generate high quality synthetic data for these models is absolutely key right being able to design data and iterate on it and uh push the envelope with respect to Performance is is huge right and so rarely do you work uh for a business where you get to use the tooling first before even users uh access it and so you as a result you get to push the batteries you get to discover a lot of bugs and fix them right you get a feel for what it means to be working with that tooling um so yeah it's it's been awesome and has been one of my requirements when I was looking for uh for a new gu so this is really cool because of the fact that not only you're playing with the tools but I I love the fact that you were mentioning differential privacy as a mechanism to ensure the security or the safety of your data I'm just curious like are you did you built it in house are you using in other service and for those who are maybe not as familiar with differential privacy why don't you also provide a little context around that as well sure yeah so differential privacy is is a relatively new technique maybe to help connect it uh to something people um are familiar with so uh I think everybody has heard about the Netflix um um million dollar price right like was a watershed moment uh in all the inl where Netflix put up a one million uh dollar prize for anybody who would be able to improve their recommender system and what they did is they released the data set um with user interaction so you're actually able to build a new algorithm uh and that data that was based on real world dat that Netflix worked with right and so when you release it what considerations come to mind well obviously we don't want users to be identified in this data uh right and so this was back in early well mid 2000 right 2006 2007 uh what they did is they essentially uh stripped pii right so they removed personal entable information if there's a name maybe that name is gun right if there's some sort of identifier it's uh it's change and so the the data set was released but the problem with doing that if you're just stripping away pii dropping some fields or replacing them with fake um uh uh data uh you're not taking into account what other patterns might be in the data and you're making a lot of assumptions about the vector of attack right if there was a malicious player who got access to that data you're making assumption about the kind of resources they have and these could be comp compute resources this could be another piece of data they can connect that data to uh and that's actually exactly what happened on nflix so I don't know if people are aware but uh Netflix was sued in court um because researchers um it's it taxes were able to connect the data set that was released to an IMDb data set and identify people right by their name by their user ID right uh and be able to say hey we actually identified individuals here so Netflix have to settle out of court with this uh and uh that is not what you want to do right when you're sharing data right like imagine being in a heavily regulated industry there's a ton of opport prer not to do this yeah for example about um different banks that are working on fraud detection right you have uh just to give a few randomiz like capital W for example right working on fraud detection algorithm and maybe you have Bank of America working on their fraud detection algorithm saving users uh and the entire industry billions of dollars and being able to detect that for it well they would be able to do it even better if they could actually share data right but how do you actually do that safely right this is just one of the examples um and one of the things that came out of uh came out in 2007 actually out of U Israel is a new approach to um be able to uh guarantee privacy through something called differential privacy and what you you at a high level do in differential privacy is you inject a control amount of noise into Data uh and uh there's a lot of theory to it we could record a whole separate podcast on all of this but you're essentially able to come up with bounds for how much uh information could possibly leak and and and um and how often uh and for the first time also so you get very strict mathematical guarantees that no matter what happens after the fact right no matter what kind of function what kind of processing you apply those guarantees do not change right so all of a sudden you're not talking about boiling the ocean with stripping out all the pi you can't you can you can't do that you have very strict mathematical guarantees that say Hey for this Epsilon and Delta this is the result you get no matter what happens down stream no matter what kind of Computer Resources an attacker has or what kind of data sets they have um and so uh quite a few differentially private algorithms have been developed um and uh we are working with the rest of the community and so we typically talk about differentially private algorithm so we have internal implementations of many of this differentially private uh algorithms that could be used to make your data a whole lot safer so at data bricks we've been talking about the Netflix grand prize challenge but we stop with the first half of the story of how but and health one of his um uh labmates developed spark because they had troubles ingesting that entire data sets they could distributed computing they got second due to a technicality of same Precision but submitted 10 minutes later so they didn't get the million dollars but then spark was created in the data brick so we stop at that story uh love that there's this follow on leading to differential privacy um yeah and don't get me wrong it was an awesome thing that happened right absolutely because you know it sparked a lot of innovation uh right uh the companies like data breaks and so no pun intended at all on that one no pun yeah um and and and really I think it um created that interest in ml right and I hope that it also contributed to a lot that we see in the Gen uh wave as well but there is that bit dark history to it as well and so uh being able to work with data safely especially in today's J age I think is super important um and so we're happy uh to uh expose more people to differential privacy and let them develop synthetic versions of the data sets right because if you now again have access to a synthetic version that preserves the statistical properties of the data set but does not include information um with mathematical guarantees does not include information that would help somebody identify the person that becomes a whole new bowl game right because you have many more teams that are able to access that data to build with that data to innovate with that data and when we look at Enterprises there is so much invaluable data that they're sitting on right uh and tying this the recent conversations again I think we will see a ton of value being unlocked in 25 uh 2025 and Beyond when companies start leveraging their data safely right and in smart ways to really differentiate because that is the true differentiator that they have right how do how do they build solutions that are grounded in the data that they have that are using all the value locked up in it awesome and I think that's a really strong motivating case for why synthetic data is so powerful and so important so you avoid situations like that and just to close things out since you talked about value in 2025 I'm just curious if you have any forecasts you can share about where we think synthetic data is headed whether it's this year in the years to come um yeah that's a great question I think um so the obvious is that I think more and more companies will wake up to the value of synthetic data and we're happy to play a a role and hope play very big role uh in this um and I think the other piece is that more and more value will be unlocked uh from that private data um I think if we look uh at the at the past let's say three to four years obviously there have been incredible improvements in capabilities of the models um but if we were to uh probably project in 20 into 2025 and 2026 uh these improvements unless you're looking at the very specialized benchmarks uh will be imperceivable to most people out there to most users what will be perceivable however is the Delta that you get from leveraging your own data right and so one of the predictions I think is that uh we will really start seeing models learning on the job right like maybe put it differently all the models uh currently are being compared to each other using standardized benchmarks right so like think like back in high school or in college right like taking the SATs all that's fine right but like even going through college ultimately you get a job right and all of a sudden you're talking about learning very different skills I think llms to some extent are going through through the same at the moment right so you look all these large language models they're all being benchmarked they get paired in leaderboards but ultimately where the True Value will happen is Val unlock is from them being able to do very job specific things right and being able to learn on the job to accomplish things much better than uh standardized tests uh uh indicate I love that analogy I had not heard that before of like SA is the generic Benchmark of uh llms so just want to say thank you so much for joining us today on datab Brew uh talking about everything from synthetic data to uh differential privacy and if anybody needs help with synthetic data we know a great platform gret AI uh if you need help with generating synthetic data so thank you again y yeah thank you so much it's been fun

Original Description

In this episode, Yev Meyer, Chief Scientist at Gretel AI, explores how synthetic data transforms AI and ML by improving data access, quality, privacy, and model training. Highlights include: - Leveraging synthetic data to overcome AI data limitations. - Enhancing model training while mitigating ethical and privacy risks. - Exploring the intersection of computational neuroscience and AI workflows. - Addressing licensing and legal considerations in synthetic data usage. - Unlocking private datasets for broader and safer AI applications. Connect with Yev Meyer and Gretel AI: https://www.linkedin.com/in/yevmeyer/ https://gretel.ai/

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Databricks · Databricks · 54 of 60

← Previous Next →

Building AI Agent Systems with Databricks

Building AI Agent Systems with Databricks

Databricks Workflows

Databricks Workflows

Automate Unity Catalog Upgrade with UCX Part 1: Overview

Automate Unity Catalog Upgrade with UCX Part 1: Overview

Automate Unity Catalog Upgrade with UCX Part 2: Installation

Automate Unity Catalog Upgrade with UCX Part 2: Installation

Automate Unity Catalog Upgrade with UCX Part 3 - Assessment

Automate Unity Catalog Upgrade with UCX Part 3 - Assessment

Automate Unity Catalog Upgrade with UCX Part 4 - Group Migration

Automate Unity Catalog Upgrade with UCX Part 4 - Group Migration

Table Migration and Catalog Design with UCX | Part 5

Table Migration and Catalog Design with UCX | Part 5

Setting Up Azure Access for UCX Table Migration | Part 6

Setting Up Azure Access for UCX Table Migration | Part 6

UCX Table Migration: Creating Catalogs and Schemas | Part 7

UCX Table Migration: Creating Catalogs and Schemas | Part 7

Automate Unity Catalog Upgrade with UCX Part 8: Code Migration

Automate Unity Catalog Upgrade with UCX Part 8: Code Migration

Streaming to Kafka Just Got Easier with DLT Pipelines

Streaming to Kafka Just Got Easier with DLT Pipelines

Data Engineering From Data to Dashboards with DABs: Crunching the Cookies Dataset

Data Engineering From Data to Dashboards with DABs: Crunching the Cookies Dataset

Epsilon helps businesses connect with their consumers using Databricks Data Intelligence Platform

Epsilon helps businesses connect with their consumers using Databricks Data Intelligence Platform

Unilever transforms operations with GenAI using the Databricks Data Intelligence Platform

Unilever transforms operations with GenAI using the Databricks Data Intelligence Platform

ActionIQ enables businesses to unlock customer data with the Databricks Data Intelligence Platform

ActionIQ enables businesses to unlock customer data with the Databricks Data Intelligence Platform

Mixed Attention & LLM Context | Data Brew | Episode 35

Mixed Attention & LLM Context | Data Brew | Episode 35

Inside Databricks SQL: Engineering innovation with Hans

Inside Databricks SQL: Engineering innovation with Hans

Inside Databricks: Engineering innovation with Michael Armbrust

Inside Databricks: Engineering innovation with Michael Armbrust

The Money Team at Databricks: driving revenue and customer growth

The Money Team at Databricks: driving revenue and customer growth

Unity Catalog unveiled: engineering data governance at scale

Unity Catalog unveiled: engineering data governance at scale

Create a view in Databricks and share it with Power BI using Delta Sharing

Create a view in Databricks and share it with Power BI using Delta Sharing

NDUS leverages Databricks Data Intelligence Platform to revolutionize higher education management

NDUS leverages Databricks Data Intelligence Platform to revolutionize higher education management

Démo Databricks de AI/BI

Démo Databricks de AI/BI

EMEA Data + AI World Tour 2024

EMEA Data + AI World Tour 2024

GenAI: The Shift to Data Intelligence - Customer Panel on Industry Use Cases

GenAI: The Shift to Data Intelligence - Customer Panel on Industry Use Cases

GenAI: The Shift to Data Intelligence - Ft. Ash Jhaveri, VP of Reality Labs Partnerships at Meta

GenAI: The Shift to Data Intelligence - Ft. Ash Jhaveri, VP of Reality Labs Partnerships at Meta

Virtue Foundation leverages the Databricks Data Intelligence Platform to advance global health

Virtue Foundation leverages the Databricks Data Intelligence Platform to advance global health

Announcing Synthetic Data Generation in Mosaic AI Agent Evaluation

Announcing Synthetic Data Generation in Mosaic AI Agent Evaluation

AI/BI Dashboards Embedding - A tutorial

AI/BI Dashboards Embedding - A tutorial

Bayer transforms global data management with the Databricks Data Intelligence Platform

Bayer transforms global data management with the Databricks Data Intelligence Platform

Databricks at AWS re:Invent 2024

Databricks at AWS re:Invent 2024

Hive Metastore and AWS Glue Federation in Unity Catalog

Hive Metastore and AWS Glue Federation in Unity Catalog

Data + AI World Tour Paris 2024

Data + AI World Tour Paris 2024

Retail reimagined: Currys data-first strategy to driving growth and improving operations

Retail reimagined: Currys data-first strategy to driving growth and improving operations

Mixture of Memory Experts (MoME) | Data Brew | Episode 36

Mixture of Memory Experts (MoME) | Data Brew | Episode 36

Verana Health Data Curation and Innovation with Databricks and AWS

Verana Health Data Curation and Innovation with Databricks and AWS

Securing SaaS Applications: Obsidian Security on Their Journey with Databricks and AWS

Securing SaaS Applications: Obsidian Security on Their Journey with Databricks and AWS

Twilio Eng VP on Data Intelligence & AI at AWS re:Invent 2024

Twilio Eng VP on Data Intelligence & AI at AWS re:Invent 2024

Chegg Eng SVP on Data-Driven Approach to Student Success with Databricks and AWS

Chegg Eng SVP on Data-Driven Approach to Student Success with Databricks and AWS

Ibotta Personalized Rewards Innovation with Databricks and AWS

Ibotta Personalized Rewards Innovation with Databricks and AWS

Simplify AI governance with #databricks AI Gateway

Simplify AI governance with #databricks AI Gateway

Databricks SQL and Power BI Integration

Databricks SQL and Power BI Integration

Databricks Serverless SQL Warehouses

Databricks Serverless SQL Warehouses

7 West powers audience growth with the Databricks Data Intelligence Platform

7 West powers audience growth with the Databricks Data Intelligence Platform

Secret to Production AI: Tools & Infrastructure | Data Brew | Episode 37

Secret to Production AI: Tools & Infrastructure | Data Brew | Episode 37

Skyflow CEO on Data Privacy with Databricks at AWS re:Invent

Skyflow CEO on Data Privacy with Databricks at AWS re:Invent

Databricks Clean Rooms Product Demo

Databricks Clean Rooms Product Demo

Dun & Bradstreet Enrichment & Monitoring, powered by Delta Sharing & Databricks Marketplace

Dun & Bradstreet Enrichment & Monitoring, powered by Delta Sharing & Databricks Marketplace

Unpacking Libraries in Databricks

Unpacking Libraries in Databricks

Providence uses an AI agent system from Databricks to help doctors improve their communication

Providence uses an AI agent system from Databricks to help doctors improve their communication

How State Street Uses AI to Transform Millions of Trades Daily

How State Street Uses AI to Transform Millions of Trades Daily

Vevo Therapeutics CEO on Curing Disease with Data at AWS re:Invent

Vevo Therapeutics CEO on Curing Disease with Data at AWS re:Invent

Over Architected with Nick & Holly: Databricks updates for Feb 2025

Over Architected with Nick & Holly: Databricks updates for Feb 2025

The Power of Synthetic Data | Data Brew | Episode 38

The Power of Synthetic Data | Data Brew | Episode 38

Use Databricks Lakehouse Federation to break down data silos

Use Databricks Lakehouse Federation to break down data silos

AI's rugby score: National Rugby League rallies fans with analytics and unified data

AI's rugby score: National Rugby League rallies fans with analytics and unified data

Open Variant Data Type in Delta Lake and Apache Spark

Open Variant Data Type in Delta Lake and Apache Spark

How would you sort Ætheldred in the alphabet using Databricks?

How would you sort Ætheldred in the alphabet using Databricks?

A guide on how to operationalize the Databricks AI Security Framework (DASF)

A guide on how to operationalize the Databricks AI Security Framework (DASF)

Future-Proof Your Asset Performance Management with Generative AI - Field Assistant Live Demo

Future-Proof Your Asset Performance Management with Generative AI - Field Assistant Live Demo

This episode explores the power of synthetic data in transforming AI and ML, covering topics like overcoming data limitations, enhancing model training, and addressing ethical and privacy risks. By leveraging synthetic data, AI practitioners can improve data access, quality, and privacy, leading to more robust and reliable AI models. The discussion also touches on the intersection of computational neuroscience and AI workflows, providing a unique perspective on the potential of synthetic data.

Key Takeaways

Leverage synthetic data to overcome AI data limitations
Enhance model training with synthetic data
Mitigate ethical and privacy risks in AI applications
Explore the intersection of computational neuroscience and AI workflows
Address licensing and legal considerations in synthetic data usage

💡 Synthetic data has the potential to transform AI and ML by improving data access, quality, privacy, and model training, while also mitigating ethical and privacy risks.

🔒 Pro feature: Ask AI to explain this lesson →

More on: AI Alignment Basics

View skill →

Interpretable machine learning applications: Part 5

Interpretable machine learning applications: Part 5

GenAI news from Weights & Biases CEO, Lukas Biewald

GenAI news from Weights & Biases CEO, Lukas Biewald

Weights & Biases

Responsible AI Winners, 2020 PyTorch Summer Hackathon

Responsible AI Winners, 2020 PyTorch Summer Hackathon

Near Real-Time Analytics to GenAI Centralized Observability | Amazon Web Services

Near Real-Time Analytics to GenAI Centralized Observability | Amazon Web Services

Amazon Web Services

Kiro Hooks | Event-Driven Automation for Your IDE | Amazon Web Services

Kiro Hooks | Event-Driven Automation for Your IDE | Amazon Web Services

Amazon Web Services

Get Started with Raven AGI

Get Started with Raven AGI

Related AI Lessons

X now offers an MCP server to make its platform easier for AI tools to use

X launches a hosted MCP server to simplify AI tool integration with its API

n8n Automation Repurpose Video Content: The 2025 Production Guide

Learn to repurpose video content using n8n automation, replacing manual labor with a self-hosted workflow solution

You’re Still Paying $200/Month for AI Tools You Could Replace With a Free Local Setup Tonight

Replace expensive AI tools with a free local setup and save $200/month

Medium · Data Science

Top 10 AI Tools Every College Student Should Know in 2026

Discover the top 10 AI tools that can enhance your college experience and future career prospects

I Asked ChatGPT to Apply to 500 Jobs (8 Interviews in 48 Hours)

Sabrina Ramonov 🍄