No Priors Ep. 11 | With Matei Zaharia, CTO of Databricks

No Priors: AI, Machine Learning, Tech, & Startups · Beginner ·🔄 Data Engineering ·3y ago

Skills: LLM Foundations90%Prompt Craft80%Fine-tuning LLMs80%

Key Takeaways

Databricks' CTO Matei Zaharia discusses the potential of smaller and more accessible AI models, such as Dolly, and the limitations of current large language models, highlighting the need for more advanced AI systems that can reason and make decisions. He also talks about the importance of long-term thinking and making decisions that will not be regretted in the future.

Full Transcript

[Music] welcome to the podcast mate thanks a lot excited to be here can you um start by telling us a little bit about the origins of databricks and um how it led you to where you are today sure yeah so so database started uh you know from a group of seven researchers at UC Berkeley uh back in 2013 and um we were really excited about um uh democratizing uh basically the use of large data sets and of machine learning so uh we had seen uh you know the web companies at the time were very successful with these things but most other companies you know most other organizations things like scientific labs and so on uh weren't and we were really excited to look at making it easier to do computation on large amounts of data and also to do machine learning uh at scale with the latest algorithms so we had started um you know doing our research we worked with some of the web companies we also started open source projects like most notably Apache spark which you know was essentially you know the first version of it was my PhD thesis and we had seen a lot of interest in these and we thought you know it would be great to start a company to really reach Enterprises and and make this type of thing much better and you know actually allow other companies to to use this stuff can you just give us a sense of what databricks looks like today from like a you know scale and product Suite perspective sure yeah so database offers a pretty uh you know comprehensive data and ml platform in the cloud it runs on top of the three uh major Cloud providers Amazon Microsoft and Google and it includes support for you know data engineering data warehousing machine learning and most interestingly all this is integrated into one product so for example you can have one definition of your business metric that you use in your bi dashboards and the same exact definition is used as a feature in machine learning and you you don't have this drift or copying data and you can just kind of go back and forth between these worlds the company has about 6 000 employees now and I last year we said that we cost a billion dollars in ARR and we're continuing to go it's a you know it's a consumption based Cloud Model where you know customers that are successful can can go over time and bring in new use cases and so on did you think the opportunity was as big as it has been when he started the company yeah we we did well we definitely didn't um you know anticipate necessarily to go to this size right it's a lot of things can go wrong but we were excited about the the Confluence of a few Trends so first of all uh you know it's so easy to collect large amounts of data and people are doing automatically in you know many Industries um and second uh cloud computing makes it possible to scale up very quickly do experiments scale down and so on which enables more companies to to work with this kind of thing and then the third one was machine learning so we thought you know these are powerful Trends and the exciting thing for you know us as a company is we we didn't like we didn't invent cloud computing we didn't uh necessarily invent big data or anything but we were able to start at a point in time when when many companies were thinking to move uh into this space and just provide a great platform for that and there's this migration already happening um and you know if you provide the best platform as people are migrating to the Cloud they'll consider it you uh still keep roots in research you have a research group at Stanford can you talk about that yeah yeah so um I'm a computer science Professor there so I split my time between that and data books and we work on a bunch of things we uh you know usually like looking farther ahead into into the future um and uh we've worked a lot on scalable systems for machine learning how to do efficient training on lots of gpus and and stuff like that or how to do efficient serving and then another thing I'm really excited about that we started about three years ago is looking at knowledge intensive applications where you combine a language model with uh something like a search engine or an API you call or something like that and you try to to produce a correct result maybe for a complicated task like do a literature survey and then like tell me you know what you found about this thing with with a bunch of references or counter arguments or whatever and I have a great group of PhD students that are working on that and you know exploring different ways to do it how did um databricks decide to start working on Dolly like what what sparked that and you know how did you first get going on that yeah so so we we've had customers working with um uh large language models of various forms you know even before chat GPD came out and uh you know but they were doing the more standard things like um translation or sentiment analysis or things like that a lot of them were tuning models for their specific domains I think we had like almost a thousand customers that were using these in in some form but then when Chad GPD came out in November it got people interested in you know using these for a lot more than just analyzing a bit of data and instead creating entire new interfaces or new types of Computer Applications new experiences in them um and so there was an intense interest in this even at a time when you know the industry in general is being conscious about spending and like which things are really required and so on this this was an exciting one and the really exciting thing about um jgpd as you both know is the instruction following go basically the the ability of it to kind of carry on a conversation and like you know listen to the things you're telling her to do and do those as opposed to just completing text or just telling you a you know small amount of information like this is a positive or negative sentiment so we really wanted to see whether it's possible to democratize this and to let people build their own models you know with their own data without sending it to some some centralized provider that's trying to sort of learn from everyone's data and uh you know kind of control their their Destiny in in this space we were exploring different ways of doing it and in particular like dolly is uh partly based on this great result from um some some other faculty members at Stanford called alpaca where they tested a way to you know basically they they use the model to generate a bunch of realistic conversations and then they use this to train another model that can now carry on conversation on its own and so uh we tried essentially cloning that approach but starting with an open source model um and it actually worked pretty well and so that's that's how it became Dolly but yeah we've been looking at the space for a while and seen you know incredible demand for uh these kinds of applications yeah I think the industry has really been uh very focused on scaling data parameter size and flops and I think you all really have showcased the power of instruction following even a you know something that's relatively smaller scale could you explain that and how that all works it's very interesting and I think there's actually a lot of research still to be done here because these models have been mostly locked up and these these these very large companies for a while and everyone thought it's too hard to reproduce them um so the the interesting thing is language models had existed for a while you you basically trained them to to complete words you know here's a missing word and the text can you fill it in and then at the beginning when people try to apply them to real applications not just you know I erased a word on my homework like fill it back in but like actual applications um they had always done various ways ways of you know training something else on top of you know say the feature representation and these um and so there was a lot of domain specific work but you could build like a sentiment classifier or stuff like that is it positive or negative probably like three years ago now uh open AI published the GPD free paper which is called language models are few short Learners and they said number one like we we trained a language model on 170 to 175 billion parameters and we we trained it on I think it's like 45 terabytes of text so lots of data lots of parameters um and it's like pretty good at language modeling and then number two they said you can actually kind of pump this with a few examples of a task and it picks up on the task and does it um so lots of people were working on that you know how do you bomb that what's the best example to show um but everyone assumed that for that capability you need a giant model to begin with so even the researchers in Academia were called into into gpd3 and trying to build you know stuff based on it and study this phenomenon and then last year 2022 openai published a Saturday paper which was uh sort of instruction tuning uh these models where they said hey we we used some human feedback and then some reinforcement learning and we got this gpdc model to uh actually just listen to One instruction it doesn't need a complicated bond with lots of examples and it kind of works and then they released a version of this as chat dpd so I think in a lot of people's minds the the scientific you know view of it was first you need a giant model and then you need this reinforcement learning thing and only then do you get this conversational capability and Broad World Knowledge so it's actually very surprising in alpaca we just had a larger data set of you know human-like conversations and we had this um you know very kind of modest size open source model uh that's only 6 billion parameter is only trained on less than one terabyte of text so like 50 times less data than gpd3 and it still has this Behavior it's a I think it's been pretty surprising to a lot of you know researchers the size of model that still gets you this kind of instruction following ability so I think if an open research problem like what exactly about these data sets is it that makes them good at this what are the limitations you know are there tasks that these are clearly voice at or better hat it's actually kind of hard to evaluate with long answers because it's hard to like automatically score them and say you know like this is a good Seinfeld skit that you generated and this is like a bad you know Barack Obama speech so but I think we'll figure this out yeah were there any things uh that emerged from the model um that you also found surprising like you mentioned one aspect of it just in terms of the approach you took and you know with uh dramatically more limited data and approach you ended up with really performant Behavior were there other things that were unexpected properties of of what you did with volley yeah I think to me the the most interesting thing is um it's um it's surprisingly good at just free form like kind of fluent text generation so um you can tell it to like create a story or create a tweet or create a scientific Vapor abstract and it does a pretty good job at that and before that whenever I talk to my you know NLP like researcher friends they thought that that creativity was the thing that required a lot of parameters from something like gpdc like they actually told me oh the knowledge intensive stuff like remembering facts tell me the capital of like France and whatever that's not surprising that a small model with a few parameters can do it but the the creativity that's like really hard so this one is actually pretty good at the creativity and generation it's less good at remembering lots of facts which kind of makes sense given the parameters so if you ask it about common topics you know it'll be good if you ask it like the author of a book you know it might give the wrong one I think we had an example because we've actually been building a slightly bigger version of this too and we had this um this question with like who is the author of um snow crash which is uh Neil Stevenson and the initial Dolly model said Neil Gaiman so you know it's still a Neil it's still uh yeah yeah it's still kind of yeah so so so it's less good at remembering facts but pretty good at coherent um uh sort of generation yeah the name Dolly basically references the first cloned mammal Dolly the sheep um can you explain the reference within the AI space yeah so it's it's based on you know cloning this other uh model from Stanford called alpaca by doing it with an open data set so and and that itself was based on something that uh meta released I think maybe three weeks ago or less uh called Lama which is they took a modest size model seven billion parameters and they trained it on uh a ton of data I think um they said 1.4 trillion tokens or something like that which is um I don't know how many bytes of data it was but it was multiple terabytes of data basically um and they said Hey by just training this for longer we got a small model that's actually producing pretty high quality content for its size um so there were all these kind of woolly sort of animals out there and we thought it's just too perfect to like clone it and there are all these other things like you know it's uh like the Dalai Lama I don't know they're all these are great names that's a good name yeah so and then um are there other things that you can share that you all have uh coming in the background at databricks or your Stanford lab in terms of this more general area of language models yeah I mean database definitely you know we're using everything we we learned from Dalia and we're learning from our customers to you know to just offer a great Suite of tools for training and operating llm applications we already have a popular um ml apps um platform and we we also have this open source project called ml flow that uh integrates with a lot of tools out there that are offering us built around so you can expect some some nice Integrations into that um you know separately we're also working on databricks product features that use language models internally and learning a lot from developing those and and you know feeding that into our products so I think in the in the next few months you can expect it and we also have this big user conference um data AI Summit coming up in June that will probably have uh you know a lot of stuff about this um and I would say as um you know as a researcher and also kind of with my databricks hat on the the thing I'm most excited about is really connecting these models with um reliable data sources and and making them really produce reliable results because if you you know if you use stat GPD or gpd4 the two big problems with it are number one like the knowledge is not up to date you know it's it's only it only knows stuff it was trained on and number two a lot of the things it says are inaccurate and it's confident but like wrong in various ways and I think you can tackle both of these by combining some kind of language model with um you know a system that that you know pulls out like valid data either from documents like a search engine or from uh you know apis and tables and stuff like that inside your company you know like for example when I talk to the chatbot in my bank it should know my latest bank account balance and transactions and stuff you know if I'm like can you can you cancel the payment I made because I unsubscribe you should just know what that means so cracking how exactly to do that isn't easy um it may actually be easier with small models and with big ones to to reduce hallucination from them but it you know I think it's still an open question but I think if we can figure this out then these become a much more reliable component in a in an application maybe we'll go from there to just like projecting a little bit about like architecture and research um you know so much of the industry is focused on model scaling right improving reasoning that way like how much do you think that matters in in terms of I guess like real world usage in production with your customers in the near term yeah great question so to me at least the relationship between scale of the model versus um you know quality of the data and supervision you put in um versus like design of an application around it and those things and like overall quality I think the relationship is not a hundred percent clear yet like to get a really reliable uh model that say I don't know can can um you know like make a pharmacy prescription or something like that maybe you need a trillion parameters you know maybe you actually need a really carefully designed data set and like supervision process which is kind of traditional sort of ml engineering type work um or maybe you actually need a clever application well like you're you're chaining together a couple of models and things and you're saying well does this make sense can I find a reference um can I show this example to a human if it's really hard um so I think it's it's a little bit open the the thing I can say for sure especially and and early and like other you know results like this to really highlighted is it does seem that the core Tech is getting commoditized very quickly so just if you just want to run you know something like today's chat GPD um it will be a lot cheaper because all these Hardware manufacturers are building devices that are that are specialized and much cheaper um and another thing that's making it less expensive is we're figuring out ways to get a smaller model with less data fewer parameters and stuff to get similar performance so that I think is happening faster than at least I would have thought um you know a few months ago um so at least to get something with today's capabilities I think it'll be uh you know it'll be very affordable and you might just be able to run it locally on you know your phone or something the question of how large can you know if you make a much larger model is it going to be a lot smarter I think it's still a bit unknown I mean there are people who argue it's going to be very good at reasoning but at the same time this kind of token by token generation we're doing now is not an amazing format for reasoning because you have to like linearly like do one say one thing at a time um so it's not really good for like making plans or comparing versions I think to get a really smart application you'll need to combine today's language modeling with some some other sort of framework around it that you know uses it multiple times or explores a plant space or whatever and then you might get something good and it's also possible that the very largest models are simply memorizing more stuff so like they're impressive in terms of trivia like I can ask it about some random topic and it'll know but they're not really like smarter at solving even a basic um you know word problem um so yeah I'm not sure it unfortunately especially with training from the web it's often very hard to tell apart like reasoning from uh memorization essentially didn't see that thing before so it's um I think actually being able to do experiment where you drain these on carefully selected data and it will will lead to better understanding of like what they can do yeah yeah that makes sense um maybe if we think a little bit just because you have great visibility from your your role at databricks like what others willing to companies need like your Enterprise customers or just generally Enterprises need to make use of these models because you said you know we believe the core technology the models themselves are getting commoditized yeah a data platform that could actually build you know reliable data right so we think that's that's like the uh you know the the bread and potatoes of like getting anything you you need some you know a basis to like sort of build on so we think that will become really important and you know maybe data platforms will have to uh evolve a little bit to be better at supporting unstructured data like text and images and so on um and and to do quality assessment and stuff like that for it uh that's one piece um I think another piece you need is you need the the ml apps piece of like being able to experiment with things deploy them um a B test them and so on and see what does better and improve it incrementally um and I also think these models will need a good connection to um operational systems inside the company to do really powerful things with like the latest data so you know you saw probably the the support for tools and in charge GPD uh you know before that there were lots of groups working on at least models integrated it into search engines sometimes into calling other tools as well like calculators I think it's still a little bit open-ended there's one extreme where people say the model will figure out what tools to use on its own I think for like Enterprise use cases that's a little bit like more than you really need you know you can kind of give it some tools and feed that stuff and it doesn't have to discover and like read the manual to figure out which one to use uh but yeah I think that's another piece you'll need for like really powerful applications and then I do think infrastructure like just basic training and serving infrastructure is important too when you start to care about performance like about latency and speed and you can see some of the you know new search engines using these models are not not that fast right like a little bit slow you know it would be nice to have it faster and for automated analytics it's even more important that it's efficient so there could be I think there'll be a lot of activity there yeah where do you see Enterprises getting the most value from investing in and I guess more traditional ML and then like some of the language model stuff to date yeah great question so traditional ml we're seeing actually virtually all major Enterprise as you know in all Industries are using it um it's it's changed a lot in the past decade actually so um and um it's it's very good for forecasting things in general and for um automating certain types of decisions so for example optimizing your supply chain right you don't have time to look at like exactly everything that's going on but um and and you know think about it and have a meeting but um you know if you do order like the right amount of like parts to meet your demand this week or if you minimize the amount of time you know an Agricultural Product like sits in a warehouse and like you lose you know degrades and quality or stuff like that um it matters a lot and it can it can have a huge impact on um you know on uh the profitability of a company so we're seeing a lot of that people applying a thought automate you know supply chain and to to automate basically the the operations in various ways and and then there are more classic cases like fire detection and stuff like that well so you know it's always an arms race and like you're trying to to do the best you can because every percent of like accuracy you do better and can can translate into you know huge impact um with um with language models specifically um and especially with kind of conversational ones um the really exciting thing is interfaces the people and I think customer support is a very obvious one uh maybe things like recommendations or asking questions on a product page you know in retail uh things like search augmented with stuff is one and we've also found that just internal apps in a company that have a lot of internal data can benefit from this kind of thing so like one of the things you know we've built for example is inside databricks we have all these resources for you know Engineers to understand the this you know how different parts of the product work how to operate it like all the apis and you know people used to just ask each other questions in these slack channels for each team um and we could use that data like the questions and answers plus the the data you know in the actual documentation to you know essentially automatically answer many many such questions and just save people a lot of time um so I do think that any app that has kind of business data or like stuff written by humans in it like um you know like your uh issue tracker for your software development or like your sales force or something like that um could benefit from you know these these kind of interfaces yeah yeah it seems like any type of forum or anything else instantly becomes like data that you can use to fine-tune or train a model that's specific to your SP your your customer support use case you could use an embedding or something to to do interesting things with it so it seems it seems like there's some really cool stuff to do are there any specific areas that um databricks is not focused on that you think would be especially interesting for somebody to build from a tooling perspective for Enterprises trying to use some of these Technologies yeah I I think there are a lot of these I think it's very early on um so uh probably one of the most obvious ones is um is just a domain or vertical specific models and tools and I think I actually think um even a lot of the the Enterprises that like have a lot of the data and various domains might turn more into Data or model vendors of some form in the future uh you know as as they use this to like build something that no one else can so I wouldn't be surprised at all if you see like the next you know wave of companies for say um security analytics or like you know biotech or or you know analyzing financial data or stuff like that um really built around um llm technology in there um and I also think in general in the app development space like how do you develop apps that incorporate these tools um it's uh it's very open it's not clear what the best way to do it is and you know you might end up with like really good programming tools that that focus on this problem I would say you know for people thinking about startups and so on like you you want your startup to have um you know a long-term defensible mode ideally something that goes over time also so anything around the unique data set for example or unique like feedback interaction you have is uh is always good right like honestly even something like adding ml features in your product that just kind of learn from your users and you know do better recommendation and so on could eventually become a motor like you know others just can't easily catch up um but I think that you know anything that's around custom data sets is sort of safest when you're working on um spark for for uh your PhD did you think you'd become a Founder was your intention to start a company or did you just think it was interesting research to do or both it really wasn't yeah I mean as a question yeah I've always been interested in just like doing you know things that help people that have have an impact help people do cool things and um you know I I had seen these open source Technologies out there for distributed data processing I thought okay well I'll try to start one and see how it goes you know what I wasn't sure that people would really pick it up and use it but I wasn't looking to be a a Founder necessarily I was just looking to do something useful in this like emerging space and honestly I thought like hey if I'm you know I I wanted I was at least considering to be a computer science professor and I thought if I'm going to be a professor and all the most exciting Computing is happening in data centers today and like I don't know how that works how am I going to teach you know computer science to to people um so I better learn about that stuff um but it turned out to be something you know more broadly interesting yeah well what was the most unexpected thing about being a Founder there are a lot of challenges along the way I think just being able to learn about all the aspects of our business and and how much complexity there is in each one you know starting out as a more technical person at first I you know I didn't really know what to expected but there's a ton of depth in each one and if you understand them if you like really try to understand them get to know the culture of people there like really get to know what they're thinking about you can make uh much better decisions across you know multiple aspects of your company is there anything that you would advise people coming from a similar background to yours I have a PhD as well although it's in biology and I feel like there's certain things that I learned in Academia that was really valuable and then there's a bunch of stuff I really needed to unlearn as I went into industry Are there specific pieces of advice you'd give to technical Founders or PhD Founders in terms of things that they should unlearn well I think you you should um unlearn like a lot of of research at least in computer science the kind of stuff that I've worked on a lot of research is basically is mostly prototyping it's like can we showcase an idea but it's not really software engineering of like we'll build a thing that can be maintained and like Gohan's flawlessly in the future and like supports you know problems so I think you should kind of unlearn just the focus on short-term stuff and think about how is this going to go over time eventually right there is a phase of the company where you're just prototyping to get a good fit but you should design things so they can evolve into you know into something that's very reliable long term the other thing is uh you know I think unlearned trying to invent everything from scratch you you should look you should really be careful about like hey where am I doing something unique or if I'm doing something different from others like why is it right you know don't do it Just for kicks so because in research is very tempting to say you know I did this new thing I'm gonna you know I'm gonna try all the fanciest like new ideas and each component of it was there something that you guys like experimented with being like you know first principles unique about that you then said you know there are systems for this a good one early on was um was just deployment infrastructure for like how do we deploy and update our software across you know all the clouds and so on and we soon realized it's better to to go with really standard things like kubernetes and tools like that then then to try to do something custom because they're evolving very quickly um so yeah that's kind of a good example where like you at the beginning you say hi how hard can it be you know let's just build something uh but then you realize wait every every month there's like new stuff coming out and maybe this isn't where we want to focus on so maybe just thinking about being like CTO now of a very large company like how is your lens as a researcher computer science researcher informed your thinking as a CTO yeah I think first of all uh as a researcher like you do learn to you you you think a lot about the long-term trends like what you know what good things look like 5 5 or 10 years from now what's the what's kind of the fundamental things here so for example this thing about llms being commoditized and um uh or honestly the the thing about them kind of maxing out at more parameters I think many many people hadn't really thought about that but if you think back like you know um there's there's a lot of room to improve efficiency usually in hardware and software for an application and this particular application is kind of simple uh because it is all basically like you know two or three different types of Matrix operations so like it's sort of the hardware designer's dream to do this stuff um and and also there's usually there are usually diminishing returns from scale um in in terms of quality of of models in general and you can also kind of see it in other areas like in in computer vision for example we don't have you know trillion parameter models you get you know actually pretty small models that you can train for a specific tasks that are good um in self-driving is another example uh you know they rapidly improved in quality up to a point and then they kind of plateaued and they're still not really you know ready for Prime Time eventually you hit some limits you know there are plenty of people who um are researchers in the field who don't really see an asymptote right um with scaling and so where do you believe that limit comes from like parameters compute data something else I just think a lot of things like scale um Sub linearly in in general now it's hard to tell for you know things like reasoning and so on but certainly in in kind of classical machine learning like for example if you're trying to learn a function that like separates positive and negative examples and the as you add more data like your your accuracy doesn't really improve linearly like with uh you know with a few examples you get a pretty good estimate of that boundary and then with more of them it gets a little bit better but it it doesn't get like that much better so it's just I think it's common that would be my main uh my main reason um now I I think the the one thing so with with language models specifically I think the part that does go linearly with more parameters or should is ability to just memorize more stuff so if you wanted to tell you like who was on the fifth episode of like friends and like what was the second line they said and stuff like that like yeah more parameters will get you a a neural network that has that just by putting that input I can can tell you that stuff uh but that wasn't that interesting to me because I think the right solution for that is look things up in a in a database like do heat retrieval I do a search index I think actually I think from a computation perspective it's very inefficient to have like a trillion parameters and have to actually load them out and add and multiply by them each time you make an inference because they're just encoding knowledge most of which you don't need for that inference so so that one I wasn't as excited about but I think other people there are people who are just excited about neural networks like how do you know this is the same kind of people who Wonder like how do brains work like how do animals learn who are just excited about wait I only had some neurons and I put in this stuff and it remembered it um but as an engineer I'm not that excited because I'm like yeah I could have built a database that did that but in terms of like hey I just trained a network with Grady and descent and it did it that is kind of cool yeah yeah I feel like people are almost the opposite way where we're actually quite bad at memorization we're very good at inferring things and so it's interesting to ask you know what is the basis for that computationally yeah but the other thing that we're learning though from this is it does seem that um the type of data you you put in um and the kind of fine-tuning essentially it's like weighing the data has has a lot of impact so this instruction tuning stuff is like really we have only a few examples of instruction following but since we do fine tune the model it's as if you we put a very high weight on it um and had lots of examples of that in our training set and um I think I mean I think it's still an open question like for example if you made a lot of examples of logical puzzles right like you just generate some problems and solutions would you get a model that's better at logical reasoning um you know there are other things you can do I also think a big problem with current models I think I hinted at this before is we're just calling them to generate one token at a time so for example you've probably seen in this like Chain of Thought to reasoning thing like if you ask a model a math problem and it just tries to answer like how many sheep were there it might say like seven or something and then it tries to make up the explanation and it's like wrong but if you tell it do the explanation for us think step by step and then answer it's more likely to be right but you can imagine other versions of that like if it had a scratch pad if it had a way to backtrack to say you know this is kind of a dead end it might become better so I think stuff like that that's kind of around the model it's still an AI system but it's not just one giant DNN can you know can further improve uh its abilities yeah yeah and you've seen that work in like really complex and impressive ways like we had known Brown um from the the sort of Cicero group um on and they have planning as part of it right versus it's just one one very large model um expect to do all the reasoning but before I interrupted you I'm sorry you were actually saying like you know you you basically make like sometimes controversial like long-term predictions about what's going to happen like you know there's an asymptote in sort of value of scale um and how does that impact like your decisions as CTO so especially as a you know as a company goes right it's like actually it becomes slower to change direction super dramatically so you really want to think about like what will we do long term or um uh you know our CEO Ali has this uh you know like decision rule of like with any decision I ask about like hey which one am I like sort of more likely to regret like five years from now not not five months from now but like you know if I don't do this or whatever like what's going to happen so you you try to think about where things are gonna go um of course you you do want to collect data and like sort of update your thoughts about it test hypothesis and and that I think is something you can get from research too like we in research we always think when we have an idea it's sort of a race to like figure out is it a good idea and can I publish it because the research Community Values novelty a lot being the first to do something you know for better or worse it's not amazing but if you just reproduce a thing that someone else did unfortunately you don't get as much credit so uh so we do think about how can we quickly validate something um but at the same time and even in research I had you know the same thing I I you you try to pick topics that will matter like for example in when I was doing my PhD I didn't do a ton with machine learning and you know I knew I knew people who did it I helped them out I built infrastructure but I didn't do ml research myself and then uh you know later I I kind of decided like yeah I am going to do some things especially around this like you know connecting machine learning through external data sources like search engines um and I I know it's going to take a while to like really learn about it and get an intuition and stuff but I think this is going to matter long term because I think the local like you know parsing semantics of what the sentence means is kind of solved already and the interesting thing will be like you know doing this in a in a bigger system yeah I I have four degrees and no PhD I've never contributed anything to the Corpus of the um the world's knowledge a lot gotta ask uh does it affect how you do investing no not really the PHD nice yeah I don't know I I have a math degree as well and I feel like that actually was a thing that forced me to think slightly differently or at least it forced a way of very logic I felt like there's a Groove in your brain for logic that gets carved yep so that that probably helped but who knows I don't know you've been working in data machine learning for a long time like where do you think we are in this generation of uh of AI yeah I think we're still at the early stages of um um AI on unstructured data so things like text and images and so on really having an impact in applications so I think you know chat GPD related features that every application is going to add will will change the way we uh you know we work with Computing and they'll also change data analytics to some extent because you'll be able to use this data and honestly I also think that in terms of building like just basic you know data infrastructure and ml infrastructure was still pretty early also it's still um uh you know many different tools you have to hook together a lot of complex integration um and you need a lot of sort of specialized people to do it and I think over time like I increasingly think that basically especially because of the capabilities of these AI models every software engineer will need to become an ml engineer and a data engineer also as they build the application and we'll we'll figure out ways of doing them recipes or abstractions or whatever that are actually easy enough for everyone to do and one analogy I like is um you know when I was learning programming which was sort of like you know mid late 90s um I I got these books on you know web applications and it was very complicated there was a book on my sequel there was a book on Apache web server like CGI bin all these things you have to hook together and now you know most developers can make a web application and like one function and even non-programmers can make something like Google forms or Salesforce or whatever that sort of you know basically it is a custom application so I think we're far away from that in data nml but it could sort of look like that it's a it's harder because it depends on this sort of static data that you've got sitting around but um I do think there's uh you know there are going to be a lot more of these applications yeah let me tell you this is a great conversation thanks for joining us on no priors thanks a lot Sharon hello thanks so much

Original Description

If you have 30 dollars, a few hours, and one server, then you are ready to create a ChatGPT-like model that can do what’s known as instruction-following. Databricks’ latest launch, Dolly, foreshadows a potential move in the industry toward smaller and more accessible but extremely capable AIs. Plus, Dolly is open source, requires less computing power, and fewer data parameters than its counterparts. Matei Zaharia, Cofounder & Chief Technologist at Databricks, joins Sarah and Elad to talk about how big data sets actually need to be, why manual annotation is becoming less necessary to train some models, and how he went from a Berkeley PhD student with a little project called Spark to the founder of a company that is now critical data infrastructure that’s increasingly moving into AI. 00:00 - Introduction 01:29 - Origin of Databricks 04:30 - Work at Stanford Lab 05:29 - Dolly and Role of Open Source 12:30 - Industry focus on high parameter count, understanding reasoning at small model scale 18:42 - Enterprise applications for Dolly & chat bots 25:06 - Making bets as an academic turned CTO 36:23 - The early stages of AI and future predictions

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from No Priors: AI, Machine Learning, Tech, & Startups · No Priors: AI, Machine Learning, Tech, & Startups · 7 of 60

← Previous Next →

No Priors Ep. 13 | With Jensen Huang, Founder & CEO of NVIDIA

No Priors Ep. 13 | With Jensen Huang, Founder & CEO of NVIDIA

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 8 | With Neeva’s Sridhar Ramaswamy

No Priors Ep. 8 | With Neeva’s Sridhar Ramaswamy

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 7 | With Stanford Professor Dr. Percy Liang

No Priors Ep. 7 | With Stanford Professor Dr. Percy Liang

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 1 | With Noam Brown, Research Scientist at Meta

No Priors Ep. 1 | With Noam Brown, Research Scientist at Meta

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 9 | With Perplexity AI’s Aravind Srinivas and Denis Yarats

No Priors Ep. 9 | With Perplexity AI’s Aravind Srinivas and Denis Yarats

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 10 | With Copilot's Chief Architect and founder of Minion.AI Alex Graveley

No Priors Ep. 10 | With Copilot's Chief Architect and founder of Minion.AI Alex Graveley

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 11 | With Matei Zaharia, CTO of Databricks

No Priors Ep. 11 | With Matei Zaharia, CTO of Databricks

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 12 | With Noam Shazeer

No Priors Ep. 12 | With Noam Shazeer

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 14 | With Sarah Guo and Elad Gil

No Priors Ep. 14 | With Sarah Guo and Elad Gil

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 2 | With Runway ML’s Cristobal Valenzuela

No Priors Ep. 2 | With Runway ML’s Cristobal Valenzuela

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 3 | With Stability AI’s Emad Mostaque

No Priors Ep. 3 | With Stability AI’s Emad Mostaque

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 15 | With Kelvin Guu, Staff Research Scientist, Google Brain

No Priors Ep. 15 | With Kelvin Guu, Staff Research Scientist, Google Brain

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 4 | With Zipline’s Keller Rinaudo Cliffton

No Priors Ep. 4 | With Zipline’s Keller Rinaudo Cliffton

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 16 | With Mustafa Suleyman, Founder of DeepMind and Inflection

No Priors Ep. 16 | With Mustafa Suleyman, Founder of DeepMind and Inflection

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 17 | With Karan Singhal

No Priors Ep. 17 | With Karan Singhal

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 5 | With Huggingface’s Clem Delangue

No Priors Ep. 5 | With Huggingface’s Clem Delangue

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 6 | With Daphne Koller from Insitro

No Priors Ep. 6 | With Daphne Koller from Insitro

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 18 | With Kevin Scott, CTO of Microsoft

No Priors Ep. 18 | With Kevin Scott, CTO of Microsoft

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 19 | With Anduril CEO Brian Schimpf

No Priors Ep. 19 | With Anduril CEO Brian Schimpf

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 20 | With Sarah Guo and Elad Gil

No Priors Ep. 20 | With Sarah Guo and Elad Gil

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 21 | With Datadog Co-founder/CEO Olivier Pomel

No Priors Ep. 21 | With Datadog Co-founder/CEO Olivier Pomel

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 22 | With Instacart CEO Fidji Simo

No Priors Ep. 22 | With Instacart CEO Fidji Simo

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 23 | With Snowflake's CEO Frank Slootman

No Priors Ep. 23 | With Snowflake's CEO Frank Slootman

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 24 | With Devi Parikh from Meta

No Priors Ep. 24 | With Devi Parikh from Meta

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 25 | With Palantir's CTO Shyam Sankar

No Priors Ep. 25 | With Palantir's CTO Shyam Sankar

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 26 | With Weights & Biases CEO Lukas Biewald

No Priors Ep. 26 | With Weights & Biases CEO Lukas Biewald

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 27 | With Sarah Guo & Elad Gil

No Priors Ep. 27 | With Sarah Guo & Elad Gil

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 28 | With Khan Academy’s Creator Sal Khan

No Priors Ep. 28 | With Khan Academy’s Creator Sal Khan

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 28 | With Khan Academy’s Creator Sal Khan (Japanese Version)

No Priors Ep. 28 | With Khan Academy’s Creator Sal Khan (Japanese Version)

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 29 | With Inceptive CEO Jakob Uszkoreit

No Priors Ep. 29 | With Inceptive CEO Jakob Uszkoreit

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 30 | With Vercel CEO Guillermo Rauch

No Priors Ep. 30 | With Vercel CEO Guillermo Rauch

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 31 | With Cerebras CEO Andrew Feldman

No Priors Ep. 31 | With Cerebras CEO Andrew Feldman

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 32 | With NEAR’s Illia Polosukhin

No Priors Ep. 32 | With NEAR’s Illia Polosukhin

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 33 | With Replit's CEO & Co-Founder Amjad Masad

No Priors Ep. 33 | With Replit's CEO & Co-Founder Amjad Masad

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 34 | With Ginkgo Bioworks Co-Founder and CEO Jason Kelly

No Priors Ep. 34 | With Ginkgo Bioworks Co-Founder and CEO Jason Kelly

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 35 | With Sarah Guo and Elad Gil

No Priors Ep. 35 | With Sarah Guo and Elad Gil

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 36 | With Hubspot's Co-Founder Brian Halligan

No Priors Ep. 36 | With Hubspot's Co-Founder Brian Halligan

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 37 | With Kawal Gandhi

No Priors Ep. 37 | With Kawal Gandhi

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 38 | With Material Security Co-Founder Ryan Noon

No Priors Ep. 38 | With Material Security Co-Founder Ryan Noon

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 39 | With OpenAI Co-Founder & Chief Scientist Ilya Sutskever

No Priors Ep. 39 | With OpenAI Co-Founder & Chief Scientist Ilya Sutskever

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 40 | With Arthur Mensch, CEO Mistral AI

No Priors Ep. 40 | With Arthur Mensch, CEO Mistral AI

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 41 | With Imbue Co-Founders Kanjun Qiu and Josh Albrecht

No Priors Ep. 41 | With Imbue Co-Founders Kanjun Qiu and Josh Albrecht

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 42 | With Sarah Guo and Elad Gil

No Priors Ep. 42 | With Sarah Guo and Elad Gil

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 43 | With Clara Shih, CEO of Salesforce AI

No Priors Ep. 43 | With Clara Shih, CEO of Salesforce AI

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 44 | With Former Square CEO Alyssa Henry

No Priors Ep. 44 | With Former Square CEO Alyssa Henry

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 45 | With Reid Hoffman

No Priors Ep. 45 | With Reid Hoffman

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 46 | Best of 2023 with Sarah Guo and Elad Gil

No Priors Ep. 46 | Best of 2023 with Sarah Guo and Elad Gil

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 47 | With Sourcegraph CTO Beyang Liu

No Priors Ep. 47 | With Sourcegraph CTO Beyang Liu

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 48 | With Covariant CEO Peter Chen

No Priors Ep. 48 | With Covariant CEO Peter Chen

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 49 | With Shopify VP of Core Product Glen Coates

No Priors Ep. 49 | With Shopify VP of Core Product Glen Coates

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 50 | With Stripe Head of Information Emily Glassberg Sands

No Priors Ep. 50 | With Stripe Head of Information Emily Glassberg Sands

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 51 | With Notion CEO Ivan Zhao

No Priors Ep. 51 | With Notion CEO Ivan Zhao

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 52 | With Pinecone CEO Edo Liberty

No Priors Ep. 52 | With Pinecone CEO Edo Liberty

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 53 | With AMD CTO Mark Papermaster

No Priors Ep. 53 | With AMD CTO Mark Papermaster

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 54 | With Sarah Guo & Elad Gil

No Priors Ep. 54 | With Sarah Guo & Elad Gil

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 55 | With Figma CEO Dylan Field

No Priors Ep. 55 | With Figma CEO Dylan Field

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep 56 | With Baseten CEO and Co-Founder Tuhin Srivastava

No Priors Ep 56 | With Baseten CEO and Co-Founder Tuhin Srivastava

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 57 | With LangChain CEO and Co-Founder Harrison Chase

No Priors Ep. 57 | With LangChain CEO and Co-Founder Harrison Chase

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 58 | The argument for humanoid robots with Brett Adcock from Figure

No Priors Ep. 58 | The argument for humanoid robots with Brett Adcock from Figure

No Priors: AI, Machine Learning, Tech, & Startups

No Priors Ep. 59 | With Sarah Guo & Elad Gil

No Priors Ep. 59 | With Sarah Guo & Elad Gil

No Priors: AI, Machine Learning, Tech, & Startups

Databricks' CTO Matei Zaharia discusses the potential of smaller and more accessible AI models and the limitations of current large language models. He highlights the need for more advanced AI systems that can reason and make decisions, and the importance of long-term thinking and making decisions that will not be regretted in the future. The video teaches viewers about the current state of LLMs, their limitations, and the potential for future advancements.

Key Takeaways

Build a ChatGPT-like model using Dolly
Understand the limitations of current LLMs
Design more advanced AI systems that can reason and make decisions
Fine-tune LLMs for specific tasks
Craft effective prompts for LLMs

💡 The current generation of AI is still in its early stages, and we are just starting to see the impact of AI on unstructured data. There is a need for more advanced AI systems that can reason and make decisions, and for more accessible and scalable AI models.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

How I built the OSS alternatives directory: GitHub ETL, Turso, and the UPSERT trap I hit

Learn how to build a data pipeline for an open-source alternatives directory using GitHub ETL, Turso, and Claude Haiku summaries

Dev.to · MORINAGA

Apache Iceberg in Production: Compaction, Catalogs, and the Pitfalls Nobody Warns You About

Learn how to use Apache Iceberg in production, including compaction, catalogs, and common pitfalls to avoid, to improve data engineering workflows

Dev.to · Gabriel Henrique

Your First Task as a Data Engineer in a New Company? Make the ETL Pipeline Testable

As a new data engineer, make the ETL pipeline testable to ensure data quality and reliability

Towards Data Science

From DataStage and Informatica to Databricks Medallion Architecture: Why Migration Is More Than Code Conversion

Learn how to migrate legacy ETL systems like DataStage to modern architectures like Databricks Medallion, and why it's more than just code conversion

Dev.to · Amit Kumar Singh

Chapters (8)

Introduction

1:29 Origin of Databricks

4:30 Work at Stanford Lab

5:29 Dolly and Role of Open Source

12:30 Industry focus on high parameter count, understanding reasoning at small model

18:42 Enterprise applications for Dolly & chat bots

25:06 Making bets as an academic turned CTO

36:23 The early stages of AI and future predictions

A Moment Frozen in Time | Arnav Iyengar | TEDxJenks Youth