Powering your Copilot for Data - with Artem Keydunov from Cube.dev

Latent Space · Beginner ·📊 Data Analytics & Business Intelligence ·2y ago

Skills: LLM Foundations80%Data Literacy70%Prompt Craft60%Fine-tuning LLMs50%

Key Takeaways

The video discusses the concept of a semantic layer for analytics, introduced by Cube.dev, and its applications in natural language queries, data transformation, and embedded analytics. It highlights the evolution of text-to-SQL queries and the role of LLMs in improving the efficiency of these queries.

Full Transcript

[Music] hey everyone welcome to the l space podcast this is swix writ editor of l space and founder smalli and alesio partner and C residents at desmal Partners hey everyone and today we have ardam kov on the podcast co-founder of cube he Aram hey Alysa hey s good to good to be here today thank you for inviting me yeah thanks for joining for people that don't know I've known ourm for for a long time ever since he started Cube and cube is actually a spin out of his previous company which is snats but and this kind of feels like going both backward and forward in time so the premise of statbot was um having a slack but that you can ask basic like text to SQL uh in slack and this was six seven years ago something like that so A Lot ahead of its time and you see start offs trying to do that today and then um Cube came out of that as a part of the infr structure that was powering stat spot and Cube then evolved from an embedded analytics product to the semantic layer and just an awesome open source Evolution I think you have over 16,000 stars on G up today uh you have a very active open source Community but maybe for people at home just give a quick like lay of the land of the of the original stat spot product you know what got you interested in like taex to SQL and what were some of the limitations that you saw then um the limitation that uh you're also seeing today in the in the new landscape I started Tad's bot in 20 16 so um the original idea was to just make a sort of a side Project based of my initial project that I did at a company that I was working for back then and I was working for a company that was building software for schools and we were using slack a lot and slack was growing really fast a lot of people were talking about Slack you know like slack apps chart spots in general so I think it was you know like another wave of you know Bots and all that we have a one more wave right now but it's it's always comes in waves so we were like living through one of this waves and um I wanted to build a b that would give me information from the different places where like a data lives to slap so was like some you know developer data like New Relic you know maybe some marketing data Google Analytics then some just a regular data like a production databases so it sells for sometimes and I wanted to bring it all into slack because we were always talking chatting you know like in slack and I wanted to see some stats in slack so that was idea stats bot right like bring stats to slack uh we buil I built that as a you know like a first sort of a site project and I published it on redit and people started to use it even before slack came up with that slack application directory so it was a little you know like a hackish weight and install it but people were still installing it so it was a lot of fun and then slack kind of came up with that application directory and they reached out to me and they wanted to feature stats bot because we were it was one of the already being kind of widely used bots on flap so they featured me on this application directory front page and I just got a lot of you know like new users signing up for that it was a lot of fun I think you know like but it was sort of a big limitation in terms how you can process natural language because the original idea was to let people ask questions directly in slack right hey show me my you know like opportunities closed last week or something like that my co-founder who kind of started helping me with uh this SL application him and I were trying to build a system to recognize that natural language but it was you know we didn't have llms right back then and all of that technology so it was really hard to build the system especially the system that can kind of you know like keep talking to you like maintain some sort of a dialogue it was a lot of like oneof frequest and like it was a lot of hid and Miss right if you know how to construct a query in natural language you will get a result back but you know like it was not a system that was capable of you know like asking followup questions to try to understand what you actually want and then kind of finally you know like bring this all context and go to generate a SQL query get the result back and all of that so that was the really missing part and I think right now that's you know like what is the difference so right now I kind of bullish that you know like if I would start stats bot again probably would have a much better shot at it but back then that was a big limitation funny thing is that we wanted to we we kind of build a cube right as we were working on on stats bot because we needed it yeah what was the ml stack at the time were you building trying to build your own like natural language understanding models like where their open- Source models that were good that you were trying to leverage I think it was mostly combination of a bunch of of things and we tried a lot of different approaches the first version was which I built like was reg apps they were working well this is the same as I did I I I did option pricing when I was in finance uh and I had a natural language pricing tool thing and it was RX it was just a lot of regex yeah yeah and then and my co-founder joined me PA he's much smarter than I am he like PhD in in Mass all that and he started to like do some stuff that was like I was like no you just do that stuff I don't know like I can do reg acts and you know like he started to do like some some models and trying to either you know like look at what that we had on the market back then or you know like try to build a different sort of you know like uh kind of models again we didn't have any foundation back in place right we wanted to build something that you know like we okay we wanted to try to use existing Mass obviously right but it was not something that we can take the model and you know like a try and run it I think uh in 2019 we started to see more like of stuff you know like ecosystem being built and then it eventually kind of you know like resulted in all this llm like what we hear right now but back then in 2016 it was not much you know like available for just a people to build on top it was some academic research right kind of been happening but it was like very very early you know like for something to actually being able to use and then that became Cube um which was started just as open source project and I think I remember going on a walk with you in s Mato in like 2020 something like that and you were like you have people reaching out to you who are like hey we use Cuban production like I just need to give you some money uh even though you guys are not a company uh what What's the story of cube then from statbot to where you are today we we built a cube at statbot because we needed it it was like the whole stats bot stack was that the first tried to transl the natural sort of language query into the uh some sort of multi-dimensional query it's like we were trying to understand okay people wanted to get a active opportunities right what does it mean is it a metric is it what a dimension here because usually in analytics you always you know like try to reduce everything down to the sort of you know like a multi-dimensional framework so that was the first step and that's where you know like it didn't really work well because we all this limitation of us not having foundational Technologies but then from the multi-dimensional query we wanted to go to SQL and that's what was semantic Clare and what was Cube essentially so we built a framework where you would be able to map your data into this concept into this metric because when people were coming to Stats bot they were bringing their own data sets right and the big question was how do we tell the system what is active opportunities for that specific users how we kind of you know like provide that context how we do the training so that's why we came up with the idea of building the semantic layer so people can actually Define their metrics and then kind of use them with a stats bot so that's how we build a cube but um at some point we saw people started to see more value in a cube itself you know like kind of building the semantic layers and then using it to power different types of the application so in 2019 we decided okay it feels like it might be a standal on product and a lot of people want to use it let's just try to open source it so we took it out of stats but and open source can I U make sure that everyone has the same foundational Lish uh the concept of a cube is not something that you invented um I think you know not everyone has the the same background in analytics and data that all three of us do uh maybe want you want to explain like olab Cube hyper Cube you know anything whatever the brief history of Cubes right uh I I'll try I know like a lot of like we video pages and like a lot of like BL post trying to go into academics of it so I'm trying to like cubes according to you yeah yeah it's just um so when we think about the just a table in a database the problem with a table it's not a multi-dimensional meaning that in many cases if we want to slice the data we kind of need to result with a different table right like think about when you're writing a SQL query to answer one question SQL query always ends up with a data with a table right so you write one SQL you got one then you write to answer a different question you write a second query so you you kind of getting a bunch of tables so now let's imagine that we can kind of bring all the tables together into multi-dimensional table and that's essentially Cube so it's just like the way that we can have measures and dimension that can potentially be kind of you know like used at the same time from a different angles and so initially a lot of your use cases were more you know bi related but you recently release a length train integration there's obviously more and more interest in again using these models to answer data questions so you've seen the chat gbt code interpreter which is renamed as like Advanced Data analysis um so what's kind of like the the future of like the semantic layer ni AI you know what are like some of the use cases that you're seeing and what do you think it's a good strategy to make it easier to do now the text to SQL you wanted to do seven years ago yeah so I mean you know when it started to happen I was just like oh my God people are now building stats bot with Cube they just have a better technology for you know like natural language so it kind of it made sense to me you know like from the first moment I saw it so I think it's something that you know like uh uh happening right now and that's chatbot is one of the use cases I think you know like if you try to generalize it the use case would be how do we use uh structured or tabular data with you know like AI models right like how do we turn the data and give the context to the data and then bring it to the model and then model can you know like give you answers make a questions do whatever you want but the question is like how we go from just a data in your data warehouse database whatever which is usually just a table data right like in a SQL based warehouses to some sort of you know like a context that system can do and if you're building this application you have to do it it's like no way you can get way around not doing this you you either map it manually or you come up with some framework or something else so our take is that and my take is that semantic layer is just really good place for this context to live because you need to give this context to the humans you need to give that context to the AI system anyway right so that's why you define metric once and then you know like you teach your AI system what this metric is about what are some of the challenges of using tabular versus language data um and some of the ways that having the semantic layer kind of makes that easier maybe I feel like imagine you're a human right and you going into like your new data analyst at the company and just people give you a warehouse with a bunch of tables and they tell you okay just try to make sense of this data and you going for all of these tables and you're really like trying to make sense without any you know like additional context or like some columns you know like in many cases they might have a weird names sometimes you know if they fall follow some kind like a star schema you know like a kimbal style Dimensions maybe that would be easier because you would have facts and dimensions colum but it's still it's hard to understand and kind of make sense because it doesn't have descriptions right and then there is like a whole like uh industry of like a data cataloges exist because the whole purpose of that to give context to the data so people can understand that and I think the same applies to the AI right like and the same challenge is that if it you give it pure table data it doesn't have this sort of context that it can read so you sort of need to write a book or like essay about your data and give that book to the to the system so it can understand it can you run through the steps of um how that works today so the initial part is like the natural language query like what are the steps that happen in between to do model to semantic layer semantic layer to seall um and and all that flow the first key tab is to do some sort of indexing so you that's what I was referring so like write a book about your data right like describe in a text format what your data is about right like what metrics it has Dimensions what is a structures of that what a relationship between this metrics what are potential values of the dimensions so sort of you know like build a really good index as a text representation and then turn it into embeddings into your you know Vector storage uh once you have that then you can uh sort ofic provide it as a context to uh to the model I mean there are like a lot of options like either fine tune or you know like sort of in context learning but somehow kind of give that as a context to this to the model right and then on this model has this context it can create a query now the query I believe should be created again semantic clear because it reduces the room for the error because what what usually happens is that your query to semantic cayer would be very simple it would be like give me that metric Group by that Dimension and maybe that filter should be applied and then your real query for the warehouse it might have like a five joints a lot of different you know like a techniques like how to avoid fan out fan traps Chasm traps all of that stuff and the bigger query the more room that the model can make an error right like even sometimes it could be a small error and then you know like your numbers is going to be off but but making a query again semantic Clare that sort of reduces the error so the model generates a SQL query and then it executes us again semantic Clare that's some and sematic L executes is against your warehouse and then send result all the way back to the to the uh your application and then can be done multiple times because what what we were missing with that's about this ability to have a conversation right with with a model like you can you can ask question and then system can do a follow-up questions you know like then do a query to get some information additional information based on this information do a query again and sort of you know like it can keep doing this stuff and then eventually maybe give you a big report that consists of a lot of like data points but the whole flow is that it knows the system it knows your data because you already kind of did the indexing and then it queries semantic layer instead of put data warehouse directly maybe just to make it a little clearer for people that haven't used a semantic layer before you can have definitions like Revenue where revenue is like select from customers and like join orders and then some of the amount of orders but in the semantically you're kind of hiding all of that away so when you do natural language to cube it just select revenue from last week and then it turns into a bigger bigger query one of the big biggest difficulties around semantic layer for people who' have never thought about this concept before it this all sounds super neat until you have multiple stakeholders within a single company who all have different concepts of what a revenue is they all have different concept of what active user is and then they so they'll have like you know Revenue revision one uh by the by the sales team you know and then Revenue revision one accounting team or tax team I don't know I I feel like I always want semantic layer discussions to talk about the not so pretty parts of the sematic layer because this is where uh effectively you ship your or chart in the semantic layer I think the way I think about it is that in the end of the day semantic layer is a code base and in CU it's essentially a code base right it's just a set of y f files with Pythons I think code is never perfect we know that like software Engineers right it's never going to be perfect you will have a lot of you know like revisions of code we have a version control which helps it easier with your revision so I think we should treat our metrics and we and sematic as a code right and then collaboration is a big part of it you know like if there are like a multiple teams that sort of have a different opinions let them collaborate on the pool request you know they can discuss that like why they think that should be calculated differently have have an open conversation about it you know like when everyone can just discuss it like an open source Community right like you go on a GitHub and you talk about why that code is written the way it's written right it should be written differently and then hopefully at some point you can come up you know like to some definition now if you still have M should have multiple versions right it's a code right so you can you can still manage it but I think the big part of that is that like we really need to treat it as a code base then it makes a lot of things easier not as spread shits like you know like a hidden Excel files the other thing is like then having the definition spread in the organization you know like versus everybody trying to come up with with their own thing but yeah I'm sure that when you talk to customers there's people that you know have issues with the product and it's really like two people trying to define the same thing one in sales that wants to look good the other is like the finance team that wants to be conservative and they all have different different definitions how important is the natural language to people so obviously you know you guys both work in modern data stack companies either now or before uh there's going to be the whole the whole wave of empowering the data professionals I think now a big part of the wave is removing the need for data professionals to always be in the loop and having non technical Forks do more of the work are you seeing that as a big push too with these models like allowing everybody to interact with the data um yeah any customer stories you can share anything like that I think it's a multi-dimensional question it's an example of you know like where you have a lot of a lot of inside the question uh so uh in terms of examp plus I think a lot of people building different you know like agents chat Bots we have a company that built a internal slack bot that sort of answers questions you know like based on the data in a warehouse and then like a lot of people kind of go in and like ask that chat bot this question is it like a real B case maybe is it a still like a toy pet project maybe too right now I think it's really hard to tell them apart at this point because there is a lot of like a hype and and you know just people building llm style because it's cool and everyone wants to build something you know kind of even at least a pad project so that's what happening qu our community as well we see a lot of like a people building a lot of cool stuff and it probably will take some time for that stuff to mature and kind of to see like what are real the best use cases but I think what I saw so far one use case was building this chatbot and we have even one company that building it as a service so they essentially Connect into Q semantic cayer and then all in their like chatbot so you can do it in a web in a slack so it can you know like answer questions Based on data in your semantic lay but and I also see a lot of things like this just being built in house and there are use cases some sort of automation you know like then that agent checks on the data and then kind of performs some actions based you know like on changes in in data but other dimension of your question is like will it replace people or not I think you know like what I see so far in data specifically there like a few use cases of llm I don't see CU being part of that use case but it's more like a co-pilot for data analyst a co-pilot for data engineer where you develop something you develop a model and it can help you to write a SQL or something like that so you know you can create a boilerplate SQL and then you can edit this SQL which is fine because you know how to edit SQL right so so you're not going to make any mistake but it will help you to just generate you know like a bunch of SQL that you write again and again right like boilerplate code so sort of a Code Pilot use case I think that's great and we'll see more of it I think every platform that is building for data Engineers will have some sort of a co-pilot capabilities and C included we're building this co-pilot capabilities to help people build semantic layers easier I think that just a baseline for every engineering product right now to have some sort of you know like a co-pilot capabilities then there are other use case is a little bit more where a cube is being involved is like how do we enable access to data from non-technical people through the natural language as an interface to data right like visual dashboards charts it's always has been an interface to data in every bi now I think we will see just a second interface as a just kind of a natural language so I think at this point many B will add it as a commodity feature as like tblo will probably have a search bar at some point say like hey ask me a question I know that some of they you know like AWS quick side they're about to announce features like this in their LBI and I think powerbi will do that especially with their deal with open AI so every company every bi will have the some sort of a search capabilities built in inside their bi so I think that's just going to be a baseline feature for them as well but that's where a cube can help because we can provide that context right do you know how or do you have an idea for how these products will differentiate once you get the same interface so right now there's like you know Tableau is like the super complicated and it's like super status like Keys here um yeah do you just see everything will look the same and then how do people differentiate it's like they all have line chart right and they all have bar chart so it's I feel like it pretty much the same bu it's I don't think bi Market will it's co going to be fragmented as well and every major vendor and most of the vendors will try to have some sort of natural language capabilities and they might be a little bit different some of them will try to position the whole product around it some of them will just have them as a check boox right so we'll see but I don't think it's going to be something that will change the bi Market you know like something that will can take the bi market and make it more Consolidated rather than you know like what we have right now I think it's still will remain fragmented let's talk a bit more about application use cases so people also use H for kind of like analytics in their product like uh dashboards and things like that how do you see that changing in more especially like when it comes to like agents you know so there's like a lot of people trying to build agents for reporting building agents for sales like if you're building a sales agent you need to know everything about the purchasing history of the customer all of the these all of these things um yeah any thoughts there what should all the AI Engineers listening think about when implementing data into into agents yeah I think kind of you know like trying to solve for two problems one is how to make sure that agents or llm a model right has enough context about you know like a t data and also you know like how do we deliver updates to the context which is also important because data is changing right so every time we change something cam we need to sure we update that context in our Vector database or something and how do you make sure that the queries are correct you know I think it's obviously big pain in this all you know like AI kind of you know like a space right now how do we make sure that we don't you know provide the wrong answers but I think you know it kind of be able to reduce the room for error as much as possible that what I would look for you know like to try to like minimize uh potential damage um and then yeah I feel like our use case you know like for Cube It's been we've been using could been used a lot to power sort of customer facing analytics so I don't think that much going to change is that I feel like again more and more products will adopt natural language interfaces as sort of a part of that product as well so we would be able to power this pieces to not only you know like a charts visual Jaws but also some sort of you know like a summar is you know like probably in the future you're going to open the page with some surface stats and you will have a smart summary kind of generated by Ai and that summary can be powered by cube right like because the rest is already being powered by C you know we had lus from from notion on the part and one of the ideas he had that I really like is kind of like thumbnails of text kind of like how do you like compress knowledge and then start to expand it a lot of that comes into dashboards you know where like you have a lot of data you have like a lot of charts and sometimes you just want to know hey this is like the three lines summary of it um yeah and yeah makes sense that you want to C that so are you how are you thinking about yeah the evolution of like the the modern data stack in in quotes whatever that means today what what's like the future of what people are going to do what's the future of like what models and agents are going to uh do for them do you have any any thoughts I feel like modern data stack sometimes is not very connect I mean it's obviously big crossover between AI you know like ecosystem AI infrastructure ecosystem and then sort of a data but I don't think it's a full overlap so I feel like when we know like I'm looking at a lot of like what's happening in the in a modern data stack right like where like we use warehouses we use B you know different like transformation tools cataloges data quality tools etls all of that I don't see I don't see a lot of being compacted by AI specifically I think the own that space is being impacted as much as any other space in terms of yes will have all co-pilot capabilities some of AI capabilities here and there but I don't see anything sort of dramatically you know being sort of you know change or shifted because of you know like AI wave in terms of just in general data space I think you know like in the last two three years we saw an explosion right like we got like a lot of tools every vendor for every problem I feel like right now we should go through the cycle of consolidation and uh you know like u i mean if five Trend and DBT MERS they can be Alters of a new generation or something like uh and you know probably some ETL to too there uh but I feel it might happen I mean it just natural waves you know like in Cycles I wonder if everybody is going to have their own co-pilot the other thing I think about these models is like you know swix was at airb and yeah there's F Tran just that's that's the versus airite I don't think it mies very well and there's the you know a lot of times these companies are doing the syntax work for you of like building the integration between your data store and like the app or another data store I feel like now these models are pretty good at coming up with the integration them eles you know and like using the docs to then connect the to so I'm really curious like in the future what that will look like you know uh and same with data transformation I mean you think about DBT and some of these tools and right now you have to create rules to normalize and transform data but in the future I could see you explaining the model how you want the data to be and then the model figure now how to do the transformation um but yeah thees cement I think it all needs a semantic layer as far as like figuring out what to do with it you know what's the data for where it goes yeah I think many of this you know like a workflows will be argumented by you know like some sort of a co-pilot you know you can dis describe what transformation you want to see and it can generate a boiler plate right of transformation for you or even you know like kind of generate a boiler plate of specific ETL driver or ETL integration I I think we're still maybe not not at the point where this code can be fully automated so we still need a human and a loop right like who can who can use this co-pilot but uh in general I think yeah data work and software engineering work can be argumented quite significantly with with with all that stuff I think the other important thing with data to is like sometimes um you know the big thing with machine learning before was like well all of your data is bad you know the data is not good for anything um and I think like now at least with the with these models they have some knowledge of their own and they can also tell you if your data is bad you know which I think is like something that before you didn't you didn't have any cool apps that you've seen being built on on Cube like any kind of like AI native things that people should should think about new experiences anything like that well I see a lot of slack bots so you know like it's just uh it's definitely like a they all remind Miss Tad part but I know like I I I played with few of them they much much better than statbot so I feel like it just it feels like it's on a Surface right it's just that use case that you really want you know think about your a data engineer in your company like everyone is liking you asking hey can you pull that data for me and you would be like can I build a bot to replace myself you know like so they will P that bot instead so it's like that's why a lot of people doing this so I think it's a first use Cas that actually people are playing with but I think inside that use case people get creative so I see B that that can actually have a dialogue with you so you know like you would come to that b and say hey show me metrics and the B would be like what kind of metric what do you want to look at it's like you will be like active users and then it would be like how do you define active users you want to see active users but you know like sort of cohort you want to see active users kind of changeing behavior over time like a lot of like a followup questions so it tries to uh sort of you know like understand what exactly you want because a lot of people and that's how many data analyst work right when people TR to ask you something you always try to understand what exactly do you mean because many people it they don't know how to ask correct questions about your data it's it's a sort of a interesting Spectre on one side of a spectre you don't know you know nothing you just like hey show me metrix and the other side of Spectra you know how to write SQL and you can write exact query to your data whereare house right so many people say a little bit in the middle and this the data analyst they usually have the knowledge about your data and that's why they can ask followup questions and to understand what exactly you want and I saw people building Bots who can do that and that's that part is amazing I mean like generating soql all of that stuff it's it's okay it's good but when The Bard can actually act like they know that your data and they can ask follow questions I think that's great y are there any issues uh with the models and the way they understand numbers you know one of the big complaints people have is like gbd at least three and a half cannot do math you know uh have you seen any limitations and Improvement and also when it comes to one model to use do you see most people use like gbd4 because it's like the best at this kind of analysis I think I saw people use all kind of models to be honest it's usually GPT so it's not I mean inside GPT it could be 3.5 or four right but it's not like I see a lot of something else to be honest like I don't I mean maybe know like some open source Alternatives but it's pretty much you know like it feels like the market is being dominated but just ch uh which is probably true in terms of the problems I think I've been chatting about it with a few people so they try just kind of you know like if mass is required to do Mass you know like outside of you know like CH itself so it would be like some additional Python scripts or something when we talking about production level you use cases it's quite a lot of python code around you know like your model to make it work to be honest it's like it's not that magic that you just throw the model in it like it can give you all the answers for like a toy use case is the one we have on a you know like our demo page or something it works fine it's great but you know like if you want to do like a lot of postprocessing do a mass on your all you probably need to code it in Python anyway that's what AC will doing M yeah yeah we we heard the same from Harrison and LR uh that most people just use open a ey we did a open as no mode emergency podcast and it was funny to like just see the reaction that people had to that and how hard it actually is to to break down some of the Monopoly what else should people keep in mind or you're kind of like at The Cutting Edge of this you know if I'm looking to build a data driven AI application I'm trying to build data into my AI workflows any mistakes people should avoid any tips on the best stack to use what tools to use I would just recommend going through to Warehouse as soon as possible I think a lot of people feel that my SQL can be a warehouse which can be maybe on like a lower scale but you know like definitely not from a performance perspective so just kind of having starting with a good Warehouse a query engine lake house that's probably like something I would recommend starting from a day Zero and there are like a ways to do it very cheap with open source Technologies too especially in the lake house architecture I think you know I'm biased obviously but using a semantic cayer preferably CU and uh for you know like a context and other than that it's just like a feel it's a very interesting space you know like in terms of AI ecosystem I see a lot of people using L chain right now which is great you know like and we build an integration but I'm sure the space will continue to evolve and you know like we'll see a lot of like uh interesting tools and maybe know like some tools it would be a better job Feit for a job I I I'm not aware of any right now now but it's always interesting to see how it evolve also it's a little unclear you know like how all the infrastructure around actually developing testing documenting all that stuff will kind of evolve too but uh yeah again it just like really interesting to see and observe you know what's happening in the space okay so uh before we go to the lightning round I wanted to ask you on your thoughts on embedded analytics and in a in a sense the kind of chat Bots that people are inserting on their websites and uh building with llms is very much sort of enduser programming or enduser interaction with their own data I love seeing embeded analytics and for for those who don't know embedded analytics is basically user facing dashboards where you can see your own data right instead of the company seeing data across all their customers it's an individual user seeing their own data as as a slice of the overall data that um that is owned by uh the platform that they're using so I love embedded analytics uh but actually over overwhelmingly the observation that I've had is that people who try to build in this market fail to monetize and I was wondering your insights on why I think overall the statement is true it's really hard to monetize you know like in embedded analytics that's why C we excited more about our internal kind of bi use case or like a companies are building you know like a chat BS for their internal data consumption or like internal workflows and embeded analytics is hard to monetize because uh it's uh historically been dominated by the bi vendors and we still you know like see a lot of you know like organizations using bi tools as a vendors so and what I was talking about bi vendors adding natural language interfaces they will probably add that to the embedded analytics capabilities as well right so they would be able to embed that too so I think that's part of it also you know if you look at the in bed analytics Market the bigger organization the big GS they're really more custom you know like it becomes and at some point I see many organizations they just stop using gendor and they just kind of build most of the stuff from scch which probably you know like the right way to do so it's sort of you know like you got a market that is very kept at the top and then you also in that middle and small segment you got a lot of vendors trying you know like to compete for the buyers and because again the bi is very fragmented and bad analytics their voice fragmented also so you're really going after the midmarket slice and then with a lot of other vendors competing for that so that's why it's historically been hard to monetize right and I don't think AI really going to change that just because it's using model you just pay to open Ai and that's it like everyone can do that right it's not much of a competitive Advantage so it's going to be more like a commodi feature that a lot of like yeah vendors would be able to leverage this is great em as usual we got our lightning round so it's true question one is about acceleration one on exploration and then take away the acceleration thing is what's something that already happened in AI or maybe you know in data uh that you thought would take much longer but it's already happening today to be honest all those foundational models I thought that you know we we had a lot of models that been in production for like quite now maybe decade or so and it was like a very Niche use cases very vertical use cases it's just like in very customized models and even when we're building stats spot back then in 2016 right it was even back then we had some natural language models being deployed like a Google translate or something that was still was a sort of a model right but it was very customized with a specific use case so I thought that would continue for like many years we will use AI we'll have all this customized Niche models but then is like foundational model they like very generic now they like they can serve many many different use cases so I think that's that is a big change and I didn't expect that to be honest and uh the next question is about exploration what is one thing that you think is the most interesting unsolved question in AI I think AI is uh subset of software engineering in general and it's sort of connected to the data as well and in software because software engineering as a discipline it's it has quite a history we build a lot of processes you know like uh toolkits and methodologies how we approach that right and now ai I don't think it's completely different but it has some unique traits you know like it's quite much not imp poent right in kind of in from many Dimensions so and like other trades so which kind of may require different methodologies may require different approaches in a different toolkit I don't think how much is going to deviate from a standard software engineering I think many sort of you know like a tools and practices that we develop our software engineering can be applied to Ai and some of the data best practices can be applied as well but it's might be a very interesting subfield like we got a devops right like get just bunch of tools like ecosystem so now like AI is kind of feels like it's shaping into that with a lot of its own you know like methodologist practic practices and toolkit so I'm really excited about it and I think it's a lot of tur of you know like unsolved still question again how do we develop with that how do we test you know like what is the best practices what is a methodologist so I think that would be an interesting to see awesome uh and then yeah the a final message you know you have a big audience of Engineers and Technical folks what something you want everybody to remember to think about to explore it's say being who could try to build a chat bot you know like for analytics bags and and kind of you know like looking at what what people doing right now I think yeah just do that I mean it's working right now so it's uh it's with foundational models is actually now it's possible to build all those cool applications I think you know like it's I'm I'm so excited to see you know like how much changed in the last six years or so that we actually now can build a smart agent I think that sort of you know like a takeways and yeah we are as uh you know like as humans in general we like we really move technology forward and it's fun to see you know like it's just a firsthand uh well thank you so much for coming on Aram this was [Music] great

Original Description

Text-to-SQL was one of the first applications of NLP. Thoughtspot offered “Ask your data questions” as their core differentiation compared to traditional dashboarding tools. Today, natural language queries on your databases are a commodity. There are 4 different ChatGPT plugins that offer this, as well as a bunch of startups like one of our previous guests, Seek.ai. Perplexity originally started with a similar product in 2022. Artem Keydunov from Cube.dev came on the podcast to talk about what the semantic layer is, and how it can work as the equivalent of RLHF for models to make it easy to build reliable data experiences with AI. 00:00:00 - Introductions 00:01:35 - History of Statsbot - Slack bot for querying data in Slack 00:04:45 - Building Cube to power Statsbot due to limitations in natural language processing at the time 00:06:50 - Open sourcing Cube as a standalone product 00:08:34 - Explaining the concept of a semantic layer and OLAP cubes 00:10:27 - Using semantic layers to provide context to AI models 00:11:54 - Challenges of using tabular vs. language data with AI models 00:13:11 - Workflow of natural language to SQL query using semantic layer 00:16:01 - Ensuring AI agents have proper data context and make correct queries 00:18:20 - Treating metrics definitions in the semantic layer as a codebase with collaboration 00:22:55 - Natural language capabilities becoming a commodity baseline for BI tools 00:24:37 - Recommendations for building data-driven AI applications 00:28:26 - Predictions on the consolidation of modern data stack tools/companies 00:30:14 - AI assistance augmenting but not fully automating data workflows 00:34:20 - Using external Python scripts to handle limitations of models with math 00:36:15 - Embedded analytics challenges and natural language commoditization 00:39:04 - Lightning round

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Latent Space · Latent Space · 8 of 60

← Previous Next →

Ep 18: Petaflops to the People — with George Hotz of tinycorp

Ep 18: Petaflops to the People — with George Hotz of tinycorp

FlashAttention-2: Making Transformers 800% faster AND exact

FlashAttention-2: Making Transformers 800% faster AND exact

RWKV: Reinventing RNNs for the Transformer Era

RWKV: Reinventing RNNs for the Transformer Era

Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai

Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai

RAG is a hack - with Jerry Liu of LlamaIndex

RAG is a hack - with Jerry Liu of LlamaIndex

The End of Finetuning — with Jeremy Howard of Fast.ai

The End of Finetuning — with Jeremy Howard of Fast.ai

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Powering your Copilot for Data - with Artem Keydunov from Cube.dev

Powering your Copilot for Data - with Artem Keydunov from Cube.dev

Beating GPT-4 with Open Source Models - with Michael Royzen of Phind

Beating GPT-4 with Open Source Models - with Michael Royzen of Phind

The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis

The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

The AI-First Graphics Editor - with Suhail Doshi of Playground AI

The AI-First Graphics Editor - with Suhail Doshi of Playground AI

The Accidental AI Canvas - with Steve Ruiz of tldraw

The Accidental AI Canvas - with Steve Ruiz of tldraw

The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert

The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert

The Four Wars of the AI Stack - Dec 2023 Recap

The Four Wars of the AI Stack - Dec 2023 Recap

The State of AI in production — with David Hsu of Retool

The State of AI in production — with David Hsu of Retool

Building an open AI company - with Ce and Vipul of Together AI

Building an open AI company - with Ce and Vipul of Together AI

Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal

Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal

A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate

A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate

Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI

Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI

Making Transformers Sing - with Mikey Shulman of Suno

Making Transformers Sing - with Mikey Shulman of Suno

A Comprehensive Overview of Large Language Models - Latent Space Paper Club

A Comprehensive Overview of Large Language Models - Latent Space Paper Club

Why Google failed to make GPT-3 -- with David Luan of Adept

Why Google failed to make GPT-3 -- with David Luan of Adept

Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI

Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Breaking down the OG GPT Paper by Alec Radford

Breaking down the OG GPT Paper by Alec Radford

High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor

High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor

This World Does Not Exist — Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)

This World Does Not Exist — Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)

LLM Asia Paper Club Survey Round

LLM Asia Paper Club Survey Round

How to train a Million Context LLM — with Mark Huang of Gradient.ai

How to train a Million Context LLM — with Mark Huang of Gradient.ai

How AI is Eating Finance - with Mike Conover of Brightwave

How AI is Eating Finance - with Mike Conover of Brightwave

How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)

How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)

State of the Art: Training 70B LLMs on 10,000 H100 clusters

State of the Art: Training 70B LLMs on 10,000 H100 clusters

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI

Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI

[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models

[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models

Synthetic data + tool use for LLM improvements 🦙

Synthetic data + tool use for LLM improvements 🦙

RLHF vs SFT to break out of local maxima 📈

RLHF vs SFT to break out of local maxima 📈

The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)

The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)

Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson

Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson

Answer.ai & AI Magic with Jeremy Howard

Answer.ai & AI Magic with Jeremy Howard

Is finetuning GPT4o worth it?

Is finetuning GPT4o worth it?

Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind

Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind

Building AGI with OpenAI's Structured Outputs API

Building AGI with OpenAI's Structured Outputs API

Q* for model distillation 🍓

Q* for model distillation 🍓

Finetuning LoRAs on BILLIONS of tokens 🤖

Finetuning LoRAs on BILLIONS of tokens 🤖

Cursor UX team is CRACKED 💻

Cursor UX team is CRACKED 💻

Choosing the BEST OpenAI model 🏆

Choosing the BEST OpenAI model 🏆

How will OpenAI voice mode change API design?

How will OpenAI voice mode change API design?

STEALING OpenAI models data 🥷

STEALING OpenAI models data 🥷

[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!

[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!

[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval

[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval

The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org

The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org

llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE

llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE

Prompt Engineer is NOT a job 📝

Prompt Engineer is NOT a job 📝

Prompt Mining LLMs for better prompts ⛏️

Prompt Mining LLMs for better prompts ⛏️

The six pillars of few-shot prompting 🔧

The six pillars of few-shot prompting 🔧

Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph

Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph

[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)

[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)

Can you separate intelligence and knowledge?

Can you separate intelligence and knowledge?

The video introduces the concept of a semantic layer for analytics and its applications in natural language queries, data transformation, and embedded analytics. It highlights the evolution of text-to-SQL queries and the role of LLMs in improving the efficiency of these queries. Viewers can learn how to build natural language queries, improve text-to-SQL efficiency, and apply data transformation techniques.

Key Takeaways

Write a book about your data to describe its context
Turn the text representation into embeddings for vector storage
Provide the context to the model through fine-tuning or in-context learning
Create a query with semantic clarity to reduce errors
Use a warehouse as a query engine or lake house for data-driven AI applications

💡 The semantic layer can reduce errors by generating a SQL query and executing it against the warehouse, and collaboration is key to managing different versions of the semantic layer.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

My first SQL interaction

Learn the basics of SQL and how it differs from Excel in data management

Dev.to · Mohammad Meezan

Müşteri Değerini Anlamak: RFM, CLTV ve Tahmine Dayalı CRM Analitiği

Learn to understand customer value using RFM, CLTV, and predictive CRM analytics for better business decisions

Medium · Machine Learning

Müşteri Değerini Anlamak: RFM, CLTV ve Tahmine Dayalı CRM Analitiği

Learn to understand customer value using RFM, CLTV, and predictive CRM analytics for better business decisions

Medium · Data Science

Müşteri Değerini Anlamak: RFM, CLTV ve Tahmine Dayalı CRM Analitiği

Learn to understand customer value using RFM, CLTV, and predictive CRM analytics to drive business growth

Medium · Python

Chapters (17)

Introductions

1:35 History of Statsbot - Slack bot for querying data in Slack

4:45 Building Cube to power Statsbot due to limitations in natural language process

6:50 Open sourcing Cube as a standalone product

8:34 Explaining the concept of a semantic layer and OLAP cubes

10:27 Using semantic layers to provide context to AI models

11:54 Challenges of using tabular vs. language data with AI models

13:11 Workflow of natural language to SQL query using semantic layer

16:01 Ensuring AI agents have proper data context and make correct queries

18:20 Treating metrics definitions in the semantic layer as a codebase with collabor

22:55 Natural language capabilities becoming a commodity baseline for BI tools

24:37 Recommendations for building data-driven AI applications

28:26 Predictions on the consolidation of modern data stack tools/companies

30:14 AI assistance augmenting but not fully automating data workflows

34:20 Using external Python scripts to handle limitations of models with math

36:15 Embedded analytics challenges and natural language commoditization

39:04 Lightning round

Spreadsheet Guy Meets the CFO: "Define How Much"

Digital Transformation with Eric Kimberling