Python Power: How Daft Embeds Models and Revolutionizes Data Processing //Sammy Sidhu// Podcast #165

MLOps.community · Advanced ·📐 ML Fundamentals ·2y ago

Key Takeaways

Utilizes Daft to embed models and optimize data processing for efficient machine learning in the autonomous vehicle industry

Full Transcript

hi I'm Sami I'm the CEO of eventual and I drink espresso with a splash of almond milk welcome back to another mlops Community podcast I am your host dimitrios and today I am flying solo for the once in a blue moon times that I do this I got the pleasure of talking to Sami sidhu this was an incredible podcast we go through all of Sammy's history it was an incredible stint that he went through in the autonomous driving autonomous vehicle era and he went through actually two Acquisitions of these startups that got bought out by bigger companies the first one got bought by Tesla and then the second one got bought by Toyota and everything that he learned from that time doing this very low level stuff with tiny ML and optimizing distilling these models really making things work fast and work reliably he put into his new company eventual and eventual has a offering that is open source and I encourage anyone to check it out it's called daf I will leave the link to in the description to that for anyone who wants to look at it and let's just jump into this conversation with him because I think you all are going to love what he talks about when it comes to the future of llms and machine learning and just data and how we store the data how we query the data especially his specialty has been the unstructured data world and the what he likes to refer to as the complex data world and how you can make that much easier than just having some metadata that points to an S3 bucket and that's really what they're trying to do at daf from my understanding of it it's much faster and it's much easier and it's much more robust so let's jump into the conversation with Sammy right now by the way if you enjoy this it would mean the world to me if you send this episode to one friend so they can share in this Joy with all of us let's get into it here's Sami sudu talk to soon [Music] great to have you here of course you didn't think you're going to come on here without me talking about California Raisins huh no uh that is true um yeah so um I'm Sammy uh I grew up on a raisin farm and now I work in complex data not just any raisin Farm though let's talk about this because there's a lot of people probably that are listening that don't understand the cultural significance of the raisin farm that you grew up on which is I'm pretty sure a lot of Millennials and people our age had your raisins in their lunchbox at lunchtime and it was just a little box of raisins that probably came with like 20 raisins and they called they said on the box I remember distinctly right it was this small box I think it was red and it said California Raisins on them it is so I grew up on a farm that was one of the producers for sun made rated and the box I think is this sorts of most people where it's the lady putting the tray of raisins out into the Sun and that's exactly what we did we would grow grapes on a Vineyard and cut them down and then make them into raisins and the cool thing is that like no matter where I go in the world I always am connecting with people over sediment raisins I was in Korea for a conference and I mentioned it at a dinner and everyone's like oh I had that all the time with a little kid too oh yeah so it's always something that connects with people they monopolize the raisin industry huh no it's interesting because it's actually a Cooperative so there's actually a bunch of farms in California that get together and solar raisins together oh so what is the process of creating raisins I know it's dried grapes right it's dried grapes but the thing with sunlight rates into thoughts made by the Sun so generally what you do is you have uh table grapes that you grow on a Vineyard and then what you end up doing is when they're ripe from the sugar level of a certain level you cut them down from the vine and you throw it on a piece of paper on the ground and then what you end up doing is you kind of just like rotate it every few days until it's into raisins and then you throw into a big Hopper so you grew up doing that you would know when the raisin season was and when the Harvest was and all that and I imagine you were cutting stuff down yeah my appearance for me to work so because my Summers were cutting raisins and rebuilding tractors so how did that take you into complex data what the hell was that trajectory like um yeah it's a good one um let's see engraving a farm I obviously liked getting my hands dirty so I worked on a lot of tractors with my dad um fixed everything around the farm and got into like electronics and computers very early because of that um my parents noticed this and got me out of the farm because you know I was pretty much doing nothing at school at that point and the Central Valley and so I moved to the Bay Area and over here I just got hooked on computers uh ended up staying local uh going to Berkeley and uh ended up focusing a lot on um high performance Computing low-level programming stuff and then get bit by the math bug uh God's machine learning and my professor who I worked under in college started a company um and that company worked essentially on building autopilot for everyone else and I ended up being the first hire there and eventually the CTO and that's kind of how my journey got started um yeah you said autopilot for everyone else what does that mean well at the time um what we were trying to build so when you think about a Timeless driver there's like multiple meanings one is what we call like level two level three driving which is stuff like um hands-off uh hands off driving where like the car will drive us off on like a highway and then you have like things that are like level four level five which will be called as eyes offer Minds off mind off which is essentially like you don't have to pay attention or even look at the road so well we were building for autopilot was kind of the level two level three which is I can go onto a Highway uh it would drive itself and do its thing and uh the way the market we were targeting is for everyone else uh one thing we noticed was that um around 2018 2019 is that most manufacturers already bundling a camera uh in their vehicle uh to do things like parking assist or like automated cruise control already and I think that we could do is by making models that we're really good at making small putting them into the in the in the car that and essentially running it on potato Hardware we could actually unlock all this functionality for self-driving without actually adding any hardware for the the customer or some challenges that you had to deal with um you're essentially writing these really complex deep learning models on potatoes it's really hard to do and think about the computing power was essentially like let's say a tenth or a half like a tenth or a quarter of a Raspberry Pi so we had to do is get all this data that we would collect uh we would have to then transform into something that we can train models on and then we would train these like Mega models and then shrink them down to run real time on Tiny Hardware how would you shrink them down what were some techniques and tricks that you learned from years of practice I imagine yeah um it's interesting stuff um there's things like you know first as data quality um if you are trading the model on a lot on a lot of data where some of them disagree or not are not consistent the model has to be bigger to kind of like see through that but one thing we notice is that if I actually just get data and clean it very well you actually need less motor capacity to represent your data set really well so just really clean data was one Avenue the other ones were a little bit more exotic which one of the things we did is I have a paper on this yeah I knew there was going to be a paper somewhere in this story yeah you know it is we did have to a lot of the technology wasn't there yet and so we had to develop a lot of it ourselves but I had a paper called squeezed ass and the idea is that um we use a network assisted search to essentially find the best network for the hardware so we did is we use the neural network to then find the optimal Network for the maximum accuracy and the lowest latency for a given Hardware set and that helped a lot I I want to know a little bit before we move on about you became CTO of this company and you were doing the very much ml stuff you also had that low level distributed computing aspect of it I guess and so when you were looking at the systems what were some of the other pieces that you were factoring in when you were architecting these smaller models on very very dumb or tiny I guess the proper word is tinyml and the half Raspberry Pi or the tenth of a Raspberry Pi what else were you thinking about as you were deploying these onto the cars for everyone else yeah there's a lot of different things I would say like the important things with the whole process right you have these models and the thing that we are optimizing for was iteration speed so what that means is producing new models quickly depending on what failures that happened and so we think about that one of the things we need to be able to do is uh produce these models quickly on these large training clusters so we were one of the early people who did distributive training and the thing that was really interesting there is when you start making these models smaller uh you end up stress the system in weird ways so one thing that would happen is because we're training these small models our gpus um we're running this pretty efficiently but the amount of data you had to ingest for these models uh increase exponentially because think about if you have like a large model the amount of GPU to CPU work is quite balanced but once you start shrinking that model the CPU work doesn't go away but the GPU work is now getting faster and faster and so we ended up getting bottlenecked in all these weird places for trading and we had to like kind of reinvent the wheel to to kind of make that work uh and then it just kept on going like we got to just we're running into weird issues like when it came to model deployment what we noticed is that some of these platforms that we were targeting were invented uh like a 10 years ago as an accelerator and what we had to do is that if we ran it on just a CPU for this you know these Hardware that um you put in the car and these Hardwares were um you could think of them as like these platforms that are made by like a company called Renaissance or nxp these companies that you pretty much have never heard of but they have a monopoly in the car uh computer chip space um but the thing is that sucked about them is that they're kind of designed by committee and so they just have a bunch of random in there like they'll be like oh we have this one accelerator for um these operations and we have this other accelerator for these other operations and so we ended up building a compiler to be like okay here's an info model and then kind of like divvy up your model onto these other chips so we had to do is design models that had layers that were easily mappable to multiple parts of the hardware him so you had to do all kinds of janky yeah it was weird it was weird and then I think the worst thing we had to do is one of the batteries we worked with had a chip that the compiler was only in Windows and that was probably like the worst month of my life and I'm guessing this is a lot of cutting through regulation too because or the chips were like this because they had to go through rigorous regulation red tape yeah it is true yeah that is the case so typically when we talk about self-driving Hardware they have different levels what we call acel and the idea is like how safe is this and so a lot of these ships have like failovers and redundancies such that like if something breaks then it can still operating as normal yeah it makes sense so inevitably you probably created a lot of stuff while you were there and is that what made you see the light and realize that hey you know what with this complex data stuff there's probably a few businesses that you could create around it I know that you you did this and then you went and you had another stint in self-driving cars right and then you started daft uh yeah so it's an interesting history so I it was at this company called deep scale and then we actually got acquired by Tesla and so I worked through that whole acquisition process and after Tesla I then went to Lyft to kind of lead a lot of their self-driving efforts around perception and then that team got acquired by Toyota so I worked at Toyota for a bit and then after that you know after four companies of essentially building complex Data Systems and being the Builder of these systems user of them I was like okay other people need tools that I kind of had I I hadn't were great but once you leave you don't really have anything and so what we want to do is kind of bring the functionality that you're familiar with like a database or data warehouse for tablet data but to the complex domain so you inevitably thought deeply about if the world needs another database what made you decide and explain more about what you're creating with Daft yeah for sure um I I guess like I'll talk a little bit about what we see as the the common issue right now what if you are dealing with a complex data so if you're a company dealing with complex data well I most often saw in the systems that I initially built were kind of bridging together the best of relational with these kind of bash processions so imagine you are a self-driving company or a a biotech company they have a bunch of things like images videos 3D scans what you would intuitively build to kind of build an engine end-to-end to process these images is getting a traditional database or data warehouse putting all the metadata in there and when it came to the assets uh like the video or image you would just have like a pointer to some remote storage like estimating so you'd put a bunch of images in S3 and have a database with all the metadata yeah and what would then happen is when I wanted to process this I would have to do three different steps I would first have a data analyst write a SQL query to be like what data should I process that then you would run that query and you get a list of files and then you do something like spark to then process them and then give you the results where you then dump the S3 or dump the results back into a SQL table then you would finally get another data analyst to actually process the end results and give you the analytics that you cared about and what we saw is like these really nasty systems that had to talk to three different teams to actually do this end to end and I was always like okay first if we're running a query we can't optimize the end to end and number two dealing with spark with images or videos or 3D scans was like a nightmare um what would happen is that spark doesn't have the right abstractions here to represent the data's natively and what would happen is that uh you would have one uh one machine get too much memory from having all these images and memory and it would boom or out of memory and then that would die and it would rebalance and they would just knock out the rest of your cluster like Dominoes Dominoes yeah yeah and I always just like killed me going through like 10 000 line log files of like Java um errors and like fighting what was the one line that caused my bug so I think I could spend collectively weeks of my life on this so painful yeah it's so painful and I I always thought like hey you know what if we had a system that was like natively understood that this is a two gigabyte file and don't just load it like dumbly in memory that would make my life so much easier so after all my citizens self-driving I kind of talked to a few companies about like hey if I work here what would you want me to work on and for the companies that I talked to uh out of five said hey we would want you to handle this whole side of untrustured data processing and I was like huh this sounded like a good startup idea and then I I paired up with someone I worked with that level five I left level five and he was also very passionate in the space and we started the company eventual it isn't uh our first product that we worked on or we're working on is called Daft which is an open source query engine for all those types of data where it understands it very intimately and the idea is that you can load in all of this data and then you can query it easily and it's the metadata is with the unstructured images and that is my understanding is that correct yeah yeah so it's kind of like I would say it's a distributed data frame that's all pythonic but the entire engines ran out in Rust and what it looks like to the user is something like a panda's data frame where I can say I'm going to read all these I want to read these millions of files in S3 as like read files I want to then process them as an image I want to crop them resize them and then I want to load them into a machine learning uh model for either inference or trading and what it looks like to you is you're just building this query using a lazy API and when it comes to execute it it will actually run on a cluster using this query plan that it develops but then also intimately understand the data it's running on so no more um errors and very optimal queries for complex data that's awesome so then this is Daft eventual is what so eventual so we want to be a lot more than manage Daft um so what we're working on uh is we have plenty we have users so most of the users of adapt right now are actually in the Enterprise space so they're right now using open source staff to do a lot of query processing of unstructured data um the part that the question we get after how do I Pro you know how do I run these queries on complex data or structure data is how should I store it for better retrieval so the next uh product that we're working on is a uh still will be open source but a manage Enterprise managed version of this which is how do I store my data efficiently for Chromebooks data so think of it as a complex catalog for everything unstructured and it differs from just that S3 bucket in what ways so if you're familiar with like a data warehouse or a database you get all these amazing features like governance um like schema Evolution like time travel but there is none of that for us for example a bucket of images in S3 and so what we're trying to do is give you the things that you're familiar with with bigquery uh or or Athena or large data warehouses but for untrustured data so think of it as I want these teams to have these permissions and I want to add this new column or I want to delete this column or I want to have a retainment policy um these are things that we're adding on top of just plain old F3 also much faster loading rather than just a bunch of single images we can actually compact it into something that's really easy to load in oh yeah how's that so there's some interesting formats you can do so what we see right now is uh people stuffing a bunch of images in parquet which isn't really the best it's great for tabular but isn't necessarily the best format for images so for a lot of this we're all we're developing our own container formats to to be able to load data really efficiently for images or you know seek to certain parts of the image from cloud storage directly oh nice so inevitably man I'm sure everyone is asking you about how you fit in in this large language model world and unstructured data is so hot right now what is your thesis or how do you look at it when it comes to whether it's whatever text to image or text to animation text to video any of that or just straight generative AI with a open Ai call how are you seeing what you're doing it eventually playing into this like new paradigm of machine learning and if at all because maybe you're like yeah right now we're focusing on this slice and it's not actually that important to us to get distracted with the shine in your toys yeah I mean that that's a good question I would say a lot of what we're building is compatible with the future of llms so some of the things some of the use cases we're working on um with some of our early users are around llms and generative AI so a big one is actually retrieval and so um are you familiar with like Chain of Thought for llms yeah dude so we talked about the exhibit the example earlier there's this great paper from Alex Radner and a lot of other people that I can't remember who but he was talking about distilled uh step by step and it's basically distilling them all have you seen that paper yeah yeah okay you know about it but I'll explain it for the listeners in case they missed it it's distilled step by step basically is asking it's distilling the models but when they distill it through the large language model they're asking for this Chain of Thought reasoning and so it makes it much easier for you to get that distilled model and train it with less data because the metadata is so rich from The Chain of Thought reasoning from prompting or Chain of Thought prompting that you get from the large language model when you're creating that training data for the distilled model yeah it's super cool stuff it's it's crazy to me how different distillation is now compared to distillation when we used to do it back in your day back in my day distillation was the kind of was kind of dumb like what you would do is you would create a big model and then you would just train the essentially like you'd use that output as the ground Truth for a smaller model and that was a foreign maybe the big model understands the labels better than the ground truth and then I don't know but anyway um come a long way from there anyway I distracted you tell me about why you were talking about distillation in the first place oh yeah so um the way I think about uh Chain of Thought reasoning is it's quite compatible with this so one of the use cases we're doing now is um being able to process data like the red pajamas data set being with the loaded and filter it and then do things like tokenization do things like run an open source model on it um and then doing things like I have a set of embeddings and I want to do this hybrid search over a large data set of find me find me uh you know sources of text that were published to this day have this type of style and then have similarity closer to this that's kind of like the first level use case the next level use case is actually plugging daf straight into the llms and having you know if I ask the llm hey give me some documents that you know show me um I don't know that are written in the style of I don't know what's your favorite author Hunter Thompson okay Hunter Thompson and so the thing is for a lot of these models we don't really quite have things pre-computed and we don't necessarily know what the other model uh query looks like and so we can actually do is actually plug in Dap as one of the query sources for llm so think of it like uh you know we've seen all these demos of llm's writing SQL to then send it off to a database and retrieve it and then parse the results we can do the exact same thing for complex data which is right I have LMS write these queries search over these large uh corpuses on S3 or in these data Lakes bring back the results and then have the alums interpret it interesting wait say that again I'm not sure if I fully grasped what you were saying there oh uh what I'm trying to say is like just how we uh you know can provide a SQL engine for LMS today to understand some of your tablet data we could do the exact same thing for callbox data which is I can have my data Lake full of multimodal or complex data and have the llm right in Daft queries to then try to understand them better um oh dude that is awesome okay I see what you're saying so then these are some of the large language model use cases I imagine four out of those five dentists that you talked to earlier or whatever they were when you went out to get a new job before starting ventral and they said hey we we want you to work on this use case in this problem area they weren't necessarily doing things with llms back in those days when you started eventual and so there's a ton of other use cases is it very much like where you see it soar where you're seeing Daft really take off is in your past life kind of world where it's the self-driving car world and I know you mentioned also drug Discovery is that another use case area that you see it taken off yeah so the the area that we see it taking off and the areas we're focusing in are essentially the the domains that don't get much love yeah and so right now if you're trying to process images there's a bunch of different tools that sort of work and so for that um it's kind of crowded but well we actually have talked to a lot of our users if I'm trying to build a way to build a queries on audio data and like read all these different audio sources uh transcribe them and then do all these different various analytics and uh filtering there's not really a good tool for that and so what we really lean into is making sure we work really well for these domains that don't get quite much love and so that's includes like audio uh 3D assets like game assets um uh microscopy images from biotech which is like these these large um uh lossless images that are hard for by the libraries to typically interpret yeah and also just weird walking things like um a big use case of ours is uh someone sets up a Kafka stream and they dump a bunch of these Proto Buffs in S3 and then now they're like hey how do I just filter over a billion protobufs for things that have this in their field and daf works great for that oh nice okay so you could potentially work incredible for playing around and querying a lot of mlops community podcasts that's a great use guys like if there's like a a bunch of sources or URLs for the podcast and we can pull them in transcribe them Chunk Up the text and then um you know and then find out interesting things you said and then maybe train a model to fine-tune it on your voice and replace you because then I don't even have to be here I can just have you interview the fake Me Maybe I'm not even real right now man who knows yeah so that's uh yeah we'll have to do some kind of hackathon on that later because I think that would be super fun to pull all that data and also it sounds like you're making it really easy compared to what I had thought it would we would have to do yeah it's really easy I mean you could spin it up in collab and then get the query working and then it's uh it's a one-line switch to switch it from running locally to running on a distributed cluster nice wow that's awesome so now what are some things that you took from your self-driving days and all this tiny ml stuff that you were doing and you're now bringing into Daft or eventual yeah I think there's a couple of Concepts that I think are really important I think one thing that I took away early was if you bogus on traditional Data Systems into Tower Space the individual rows don't really matter it's about the analytics result right but in complex data and self-driving the individual rows matter hey Lazlo here if you're serious about mlops you hit subscribe right now so it's like if I'm trying to search for failure cases the individual results matter not the analytic part of it and so it's almost like usable system that's like hybrid transactional analytics but you're not really doing transactions so one thing we both adopt early is how do we make sure like needle and the haystack queries are really performant and easy to do yeah that makes 100 sense so it feels like and this again going back to that whole idea of the self-driving cars it feels like I know we mentioned it beforehand it's one of those high risk or very high what there's a special word for it that the EU uses to classify the ml use case I can't remember it's something High damn I can't remember but um basically it's dangerous because there's lives that are potentially affected in very bad ways if goes off the rails and so making these needle in Haystack use cases or um cases able to you're able to find them really quickly in Daft seems like yeah it's a no-brainer you came from that world and now you're seeing hey how can I make that very useful even if it isn't with this High danger situation yeah yeah the needle in the haystack type queries we see them a lot now in in the general AI space as well which is find me this document that exhibits XYZ and you end up searching like you're not saying how many documents are there that exhibits will say you're you're saying give me the documents that exhibit this and so we see this pattern quite often now where you want kind of the best about the world and so we're doing that and for the whole safety thing it's quite interesting because I'm seeing a lot of similarities between the self-driving domain and thinking about safety and the whole album stuff that's happening now like I would say for any AI you develop it's important to essentially have a safety net so it can't do harm and the way you think about how rigor suspecting has to be is what can it do so when you think about self-driving what we would do is you know you need a way for something to take over and so we would have the self-driving stock you would have a safety layer in software to prevent things like you know hitting the curb or hitting a person that was outside of the traditional like uh system that would be running for you know the the end-to-end automation but then you also have a safety driver who would take over a minute a second so this so essentially they would be you know hand hovering the wheel and grab the car and disengage whenever something what happens yeah and the the way that any of these AIS are developed including you know the ones that we see now is you have feedback loops you you essentially put something out there You observe how it does you see the failures and then you improve it where as for self-driving that feedback loop is very expensive because you need a human to literally constantly monitor it so all the feedback all the validation you're getting is supervised by a human but for the whole genius gen AI space it's quite interesting because I put them in two domains one domain is where it's okay to fail and I see that in a lot of creative uh applications where if I'm doing copywriting or um you know image generation right yeah exactly any of those there's not really a right answer and there's no like right answer to fail and so it's okay right there's a use case where um you can put out there and you use the feedback you're getting from users is what images are actually clicking and actually downloading um based off what they generate but then other domains like let's say legal AI is you still have that feedback loop but then they have to be reviewed by lawyers but each in each time that they're being run it's going to improve over time so I kind of view it analogous but a little bit different I remember the word the high high stakes yeah dude I can't believe I was totally blanking on that one so going back to what you're saying though 100 I fully agree with you on that where there are these expensive feedback loops but you also want to make sure that there is some kind of safety net just in case and it feels like there's these use cases with generative AI where you don't need safety nets per se because the worst that can happen is you get a deformed face on an image that you generate or you get a copyrighted image that you generate and you got to figure that out and make sure that you just generate something better and you become the curator and whatnot or when it's generating text then you can add your flavor to it and there's not really any high stakes there but when it comes to more of this legal documentation or if we're talking about I mean there's there's a few different high-stakes potential here that you do need a bit of a safety net yeah agreed like that's the thing I feel like it needs to be understood better I feel like I see a lot of companies or individuals starting these projects that they are great but you know for example legal AI is like I think I see a future where it works where the lawyers are using it but at the same time I pay my lawyers a lot of money because they don't make mistakes sure otherwise I would do myself you know what I mean yeah exactly I I know how to prompt this yeah whatever Harvey came out that so I could figure that out on my own I I understand what you're saying on that and and then coming back to what you're doing with Daft and how you were thinking about that how does that tie in yeah so a lot of these use cases like for example Harvey might do is they have these corpuses of legal documents and they want to be able to get these documents uh you know use the image processing on them extract out and extract out text and then run them through llms um those are things adapt would be a really really good tool what we see right now in the uh the general AI space is that a lot of people are just writing Python scripts that go over each file extract the text manually and then hit the open AI uh um API and they kind of didn't dump that either in a file or like one of these Vector databases and then kind of the process can literally restart once they switch the model or switch the kind of thing that they're doing and so we want to make that really easy to do which is I can build my entire pipeline just using a data frame like I would do for pandas and when you're ready to commercialize it or make it production ready you just switch a couple flags and then you can run this on a cluster so it makes their life a lot easier is what I would say yeah oh dude I see the vision then I see and I understand that so one thing that I also wanted to mention and bring up is and you kind of hinted at this before is around how you think about the needs and trade-offs when it comes to tabular data versus this unstructured or complex data and certain things that you need in tabular data and other things that you need in complex data and how you look at the two architectures and if you're setting up a system right now and you only need to go down one route maybe it's the tabular data route then what are you going to be building for and optimizing for and then if you have complex data what are some things that you're keeping in mind and optimizing for yeah that's a good question um I guess like talking about the like the the axioms about I think is really important so when we talk about Tabler data you typically think like um integers floats um you know things like strings and most of the time when you have this data um the queries you're running are usually aggregations so what that means that I have like these files of text they could be large files of text and what I'm essentially doing is going through each row and then doing things like min max sum these very cheap operations to compute so if you think about it as what is the volume of data I have and the volume of compute it's actually a much more data heavy so it's like I might have gigabytes of data but I only have you know billions of operations so they're quite balanced we'll talk about an image images and things like text for llms is completely different whereas if I have like a one kilobyte or sorry like a hundred kilobyte image I am processing potentially trillions of operations on it uh-huh right or if I have like let's say a work of Shakespeare which is only like a megabyte I will be processing you know quintillions of operations using an element and so the ratios that we think about of compute over the data is completely different so when you're building callback systems you have to think about as I will be compute bound almost every single time whereas like if you think about traditional query engines you're like okay all they're doing is optimizing for you know reading from S3 and they don't really bother about the compute which is fair because they're gonna be ball knocked by that uh so then what are some okay I and I love you break it down by the axioms and you think about hey let's look at the fundamentals of One Versus the fundamentals of another and what the bottlenecks are going to be and what you're going to encounter and so then as you're Building Systems around this what are some things that you would say I definitely want to have in my tool kit if I'm dealing with one versus the other yeah um I would say if I was building an allies engine I would go for something simple and something that failures are handled in you know in a way where we can recover easily so for example like if um you're running things on like a spark cluster and you um it's fine because the amount of work you have to reproduce is not it's not bad okay but if you talk about that in terms of like complex data cluster that does not work very well and so what we're building around that is first making sure Daft understands the data types natively so they understand what an image is and how how expensive it is it to bring into memory how expensive is it to send it around and also for the various other formats as well also understanding that placement and scheduling is very important and we require really high utilization so an example is if you get a lot of these complex data workflows and map it to something like a spark cluster you only use about 20 of the hardware okay and so if you're using gpus or you know you're running llms or a computer vision model on this data you're pretty much burning five times more money than you have to and so some of the things that we're doing here is saying okay you know what our users are typically Enterprise companies who want to save money and by porting things into daf they can run things uh that's you know pythonic using data frames but then still get really good utilization and cheap throughput essentially for their whole system and so yeah the things we're designing around there once again are uh no really understanding the data we're working with um having query plans that are necessarily not optimized for it tablet analytics but for complex data processing and actually being really tied into the hardware we're running on so really understanding the AWS machines we run on and how to get the most out of them built into the framework I wanted to go down one route but then when you set that last part it's it's like oh uh I thought about something else something else got triggered you have this background and Tiny ML and how does that play into things and that like with Daft what is it optimizing for or how are you thinking about those kind of use cases it's pretty interesting um so so I guess one thing I'll go into first is what what's the eventual goal of Daft right uh no pun intended there um so the the way that we're building out depth is right now although we require so our user interface right now is in Python and all of the engine and stuff is a hybrid between Python and Russ as we're building more and more we're getting to a upload where we can run completely serverless where you can run things in Python on your laptop but then when it executes it can completely route out state in a cloud cluster and then spin down as soon as it's done and one of the difficult things with that with you know models is that most of the models right now require python uh so you typically use Pi torch or tensorflow and you embed that um into Daft or typical typical workloads using Python and do a bunch of this glue code but one observation that we've noticed is that you know although the beginning we wanted to make things really easy for python to build models and run them in dab we noticed that a lot of people just run the same models uh adapt like they run the same you know LM model they're in the same computer vision models and they might change things like weights but the models are generally there so one of the things that we're going to in the future which is where my background comes in a tiny ml is how do we actually embed models in daf where we can run them you know compile them down to Hardware uh run them on CPU or GPU natively without any python or any framework um and then execute that very efficiently on the cluster and so that's one of the things that we're working on which is if I have something like a llama model we can actually just package that as part of Daft run it on the cluster and there's like no work on your part you really have to do and then you can bring it down to the running it on a potato as you said and having this capability just the out of the box that works it works and like I think that's it's getting really interesting because if we think about like the inference per dollar for a lot of these models gpus are usually the most efficient right like in terms of a cost per dollar uh Emperors per dollar but the problem is no one can get gpus now there's a massive shortage which is stuff that I dealt with that when I was working at Lyft we would use all the gpus in a single Data Center and we had to learn a lot about how do we do things uh differently so we can actually do our work and so we're seeing that now with this whole gen AI wave is that I I I'm in this uh slack group for AWS support and every day people were like can I get more gpus and the guys are like no right is that we have something and so what do you do if you need to get this if you need your done and you can get gpus you need to adapt yeah yeah yeah you have to figure out what's the what's the way around this and how can we make it work without the gpus that we have been relying on this whole time so what's the workaround yeah so I think a big one is I mean have you seen all those really cool work on uh like llama CPP no wait is this the one where it's just super small llama models uh super small so this guy um essentially got Facebook's llama model and then yeah he did Whisperer and then he did yeah yeah and you can run it on your computer right like you can run it on a CPU on your whatever laptop yeah exactly and so what he did is like made the model smaller so if it's better in memory and then yeah I wrote a hyper optimized version and purely in simple past using factorized operations and whatnot yeah there's one model and so when when you compile the model you essentially have a single binary that does the inference the model and I kind of see that as you know the future a lot of these LMS which is you can package these standardized llms that are optimized for your Hardware as part of your query and use it to extract or do generation or whatever task you want uh incredible I mean this is making the assumption that I'd uh keep telling people that open source models are going to be good enough and right now they're not there but the big assumption that everybody's saying is like open source will be there in six months don't worry about it it's going no matter what yeah I'm a big I'm big on open source like I you know our whole a whole company is on open source I I think open source will attack I mean I I see I've seen a lot of uh analogs that computer revision was so in the I started doing computer vision on like 2012. and during that time the big bad computer vision algorithms were all in companies you had companies like clarify you had Google you had like apple they were doing all this crazy computer vision but then you know you would have these open source research papers come out and then they would be a little bit better and then the companies would surpass them and then we got to a point where people just kind of stopped caring because the open source was just so good and that's what people do is they just use them as a black box for their tasks and that's kind of what we see now for computer vision yeah yeah I mean I hope it is like that don't get me wrong I am a huge proponent of the open source world doing its thing and optimizing it and making it free and open of course I just we haven't seen it yet and a few of these attempts I try and play around with them and I'm like dude this isn't six months behind this is like two years behind I I would agree with that like it's probably going to be more like years like two one to two three years rather than months um because I'm in the moat here once again is like a big cluster to train these models on yeah you have a huge cluster where you can train these models on then I think you can do it but they're to get to the base level I think the next part is then making these fine-tuned data sets of uh like q a like open AI has done yeah yeah yeah 100 percent I'm excited for it though uh I I'm a big I'm a big Runner of uh open source and I think it will take off awesome Sammy this has been fascinating man I love talking to you about all of this the history of where you've came from where you're going what you're doing last question I have for you is over the years you have inevitably succeeded on a few things you've failed on a few things where do you feel like you succeeded where others typically have failed and why do you think you succeeded hmm it's a good question um I feel like it's because I care about things that people told me not to care about oh oh and I think the biggest one is when I started my career is that the advice I got from a lot of software Engineers was oh don't worry about making this run fast or don't worry about going low level uh it's not worth it and the advice I got for them came from an era when um you know computer stuff is getting faster when you do no work so you can write shitty software and computers will just get faster and then your code will be no no longer an issue or the other thing they would say is that oh yeah computers are much cheaper than software engineers all right but then I would always put my Optimizer hat on and go into the assembly or go into the little level and actually just really understand why things were running the way that they did and so I deep skilled like that ended up paying off and at Lyft and and Toyota I ended up paying off and now I feel like that experience is really getting there which is understanding why things are slow or why things are fast because I did not listen to the advice people gave me in the beginning of my career incredible man this has been so cool thank you so much for coming on here I think we'll end it there yeah thanks for having me it's been a great time yeah hey everyone my name is aparna founder of arise and the best way to stay up to date with Emma lops is by subscribing to this podcast [Music]

Original Description

MLOps Coffee Sessions #165 with Sammy Sidhu, Python Power: How Daft Embeds Models and Revolutionizes Data Processing. // Abstract Sammy shares his fascinating journey in the autonomous vehicle industry, highlighting his involvement in two successful startup acquisitions by Tesla and Toyota. He emphasizes his expertise in optimizing and distilling models for efficient machine learning, which he has incorporated into his new company Eventual. The company's open-source offering, daf, focuses on tackling the challenges of unstructured and complex data. Sammy discusses the future of MLOps, machine learning, and data storage, particularly in relation to the retrieval and processing of unstructured data. The Eventual team is developing Daft, an open-source query engine that aims to provide efficient data storage solutions for unstructured data, offering features like governance, schema evolution, and time travel. The conversation sheds light on the innovative developments in the field and the potential impact on various industries. // Bio Sammy is a Deep Learning and systems veteran, holding over a dozen publications and patents in the space. Sammy graduated from the University of California, Berkeley where he did research in Deep Learning and High Performance Computing. He then joined DeepScale as the Chief Architect and led the development of perception technologies for autonomous vehicles. During this time, DeepScale grew rapidly and was subsequently acquired by Tesla in 2019. Staying in Autonomous Vehicles, Sammy joined Lyft Level 5 as a Senior Staff Software Engineer, building out core perception algorithms as well as infrastructure for machine learning and embedded systems. Level 5 was then acquired by Toyota in 2021, adopting much of his work. Sammy is now CEO and Co-Founder at Eventual Building Daft, an open-source query engine that specializes in multimodal data. // MLOps Jobs board https://mlops.pallet.xyz/jobs // MLOps Swag/Merch https://mlops-community

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from MLOps.community · MLOps.community · 0 of 60

← Previous Next →

Our 1st MLOps Meetup // Luke Marsden // MLOps Meetup #1

Our 1st MLOps Meetup // Luke Marsden // MLOps Meetup #1

MLOps.community

Remote Collaboration as a Data Scientist

Remote Collaboration as a Data Scientist

MLOps.community

MLOps Manifesto with Luke Marsden from Dotscience

MLOps Manifesto with Luke Marsden from Dotscience

MLOps.community

MLOps lifecycle description

MLOps lifecycle description

MLOps.community

What Does Best in Class AI/ML Governance Look Like in Fin Services? // Charles Radclyffe // MLOps #2

What Does Best in Class AI/ML Governance Look Like in Fin Services? // Charles Radclyffe // MLOps #2

MLOps.community

Life purpose and too many spreadsheets

Life purpose and too many spreadsheets

MLOps.community

Explainability, Black boxes and EU white paper on reproducibility

Explainability, Black boxes and EU white paper on reproducibility

MLOps.community

Hierarchy of Machine Learning Needs // Phil Winder // MLOps Meetup #3

Hierarchy of Machine Learning Needs // Phil Winder // MLOps Meetup #3

MLOps.community

Automatically Retrain Machine Learning Models? Are best practices worth it?

Automatically Retrain Machine Learning Models? Are best practices worth it?

MLOps.community

Building an MLOps Team? Key ideas to keep in mind

Building an MLOps Team? Key ideas to keep in mind

MLOps.community

Hierarchy of MLOps Needs

Hierarchy of MLOps Needs

MLOps.community

Bare necessities for getting an ML model into production

Bare necessities for getting an ML model into production

MLOps.community

MLOps and Monitoring

MLOps and Monitoring

MLOps.community

How Phil Winder got into Data Science and Software Engineering

How Phil Winder got into Data Science and Software Engineering

MLOps.community

Provenance and Reproducibility in Machine Learning; what is it and why you need it?

Provenance and Reproducibility in Machine Learning; what is it and why you need it?

MLOps.community

Friction Between Data Scientists and Software Engineers

Friction Between Data Scientists and Software Engineers

MLOps.community

MLOps Problems in different size companies

MLOps Problems in different size companies

MLOps.community

ML tooling in large companies

ML tooling in large companies

MLOps.community

ML Platforms - The build vs buy question

ML Platforms - The build vs buy question

MLOps.community

ML Services Gateway at SurveyMonkey

ML Services Gateway at SurveyMonkey

MLOps.community

Message buses, Async and sync architecture

Message buses, Async and sync architecture

MLOps.community

MLOps #4: Shubhi Jain - Building an ML Platform @SurveyMonkey

MLOps #4: Shubhi Jain - Building an ML Platform @SurveyMonkey

MLOps.community

Hybrid Data Science Teams @SurveyMonkey

Hybrid Data Science Teams @SurveyMonkey

MLOps.community

How do you handle ML version control at SurveyMonkey

How do you handle ML version control at SurveyMonkey

MLOps.community

Doing ML with Personal Information

Doing ML with Personal Information

MLOps.community

Evolution of the ML feature store @SurveyMonkey

Evolution of the ML feature store @SurveyMonkey

MLOps.community

Developing a Machine Learning Feature Store

Developing a Machine Learning Feature Store

MLOps.community

Auto retrain ML models is not the question

Auto retrain ML models is not the question

MLOps.community

3 key parts to Machine Learning monitoring

3 key parts to Machine Learning monitoring

MLOps.community

MLOps Meetup #6: Mid-Scale Production Feature Engineering with Dr. Venkata Pingali

MLOps Meetup #6: Mid-Scale Production Feature Engineering with Dr. Venkata Pingali

MLOps.community

MLOps meetup #5 High Stakes ML: Active Failures, Latent Factors with Flavio Clesio

MLOps meetup #5 High Stakes ML: Active Failures, Latent Factors with Flavio Clesio

MLOps.community

MLOps: Airflow Pros and Cons

MLOps: Airflow Pros and Cons

MLOps.community

Specific challenges in Machine Learning

Specific challenges in Machine Learning

MLOps.community

Current State Of Machine Learning

Current State Of Machine Learning

MLOps.community

Humans in the Loop are a defining factor in Machine Learning

Humans in the Loop are a defining factor in Machine Learning

MLOps.community

Learning from real life Machine Learning failures

Learning from real life Machine Learning failures

MLOps.community

Survivorship Bias in machine learning tutorials

Survivorship Bias in machine learning tutorials

MLOps.community

Swiss Cheese model in Machine Learning

Swiss Cheese model in Machine Learning

MLOps.community

Resume driven development in Machine learning & software engineering

Resume driven development in Machine learning & software engineering

MLOps.community

Who has the highest standards in ML?

Who has the highest standards in ML?

MLOps.community

Venkata Pingali of Scribble Data Thoughts on the Current State of Machine Learning

Venkata Pingali of Scribble Data Thoughts on the Current State of Machine Learning

MLOps.community

Dependable data and being able to Trust in your Data with Venkata Pengali of Scribble Data

Dependable data and being able to Trust in your Data with Venkata Pengali of Scribble Data

MLOps.community

Speed, Trust, Evolution and Scale in MLOps

Speed, Trust, Evolution and Scale in MLOps

MLOps.community

More difficult transition for data scientists to become ML engineers

More difficult transition for data scientists to become ML engineers

MLOps.community

How many models in prod til I need a dedicated ML platform?

How many models in prod til I need a dedicated ML platform?

MLOps.community

Deeper thinking from data scientists around platform blackholes

Deeper thinking from data scientists around platform blackholes

MLOps.community

Checkpointing, metadata, and confidence in your data

Checkpointing, metadata, and confidence in your data

MLOps.community

Adjacent usecases and multistep feature engineering

Adjacent usecases and multistep feature engineering

MLOps.community

Standardization of Machine Learning tools like in Software Engineering with Venkata Pingali

Standardization of Machine Learning tools like in Software Engineering with Venkata Pingali

MLOps.community

Reproducability flaws in end to end Machine Learning debugging

Reproducability flaws in end to end Machine Learning debugging

MLOps.community

3rd wave of data scientists

3rd wave of data scientists

MLOps.community

MLOps meetup #7 Alex Spanos // TrueLayer 's MLOps Pipeline

MLOps meetup #7 Alex Spanos // TrueLayer 's MLOps Pipeline

MLOps.community

MLOps Meetup #8 Optimizing Your ML Workflow with Kubeflow 1.0

MLOps Meetup #8 Optimizing Your ML Workflow with Kubeflow 1.0

MLOps.community

Are Kubeflow and Airflow complementary?

Are Kubeflow and Airflow complementary?

MLOps.community

Why Kubeflow gained so much traction=open community

Why Kubeflow gained so much traction=open community

MLOps.community

Who decides the dirrection of Kubeflow

Who decides the dirrection of Kubeflow

MLOps.community

What do Kubeflow and Arrikto do and how do they work together?

What do Kubeflow and Arrikto do and how do they work together?

MLOps.community

Versioning your ML steps with Kubeflow

Versioning your ML steps with Kubeflow

MLOps.community

Machine Learning Lifecycles//Perception vs Reality

Machine Learning Lifecycles//Perception vs Reality

MLOps.community

Kubeflow vs SageMaker in Machine Learning

Kubeflow vs SageMaker in Machine Learning

MLOps.community

Related AI Lessons

Stop Overfitting With Basically One Line of Code

Learn to prevent overfitting with a simple code tweak and understand the difference between Ridge and Lasso regression

Stop Overfitting With Basically One Line of Code

Learn to prevent overfitting in machine learning models with a simple code tweak and understand the difference between Ridge and Lasso regression

Medium · Machine Learning

Stop Overfitting With Basically One Line of Code

Prevent overfitting in models with a simple code tweak, understanding the difference between Ridge and Lasso regression

Medium · Data Science

Stop Overfitting With Basically One Line of Code

Learn to prevent overfitting in machine learning models with a simple code tweak, comparing Ridge and Lasso regression techniques

Medium · Python

Learn Deep Learning by Hand (Beginner's Guide - Part 1)