Fireside Chat #11: The Open-Source Modern Data Stack

Outerbounds · Intermediate ·📊 Data Analytics & Business Intelligence ·3y ago

Key Takeaways

The video discusses the open-source modern data stack, focusing on Apache Iceberg, Metaflow, and other tools for data analytics and machine learning, with Jason Reid, Head of Product and co-founder at Tabular, sharing his expertise.

Full Transcript

hi everyone it's Hugo Bown Anderson here from out of bounds I am more than excited to be here today with uh Jason Reed uh head of product and co-founder at tabular um to talk about the open source modern data stack and everything that entails um we're going to get started in a few minutes we're just going to wait for a few more people to roll on in so if you're uh able to introduce yourself in the chat on YouTube let us know who you are where you're calling in from um what you're up to with data science and machine learning um and and what you do that would be super awesome all right back in a couple hey everyone it's Hugo Bown Anderson here from out of bounds um thank you so much for joining us today to chat with uh Jason Reed uh head of product and co-founder of tabula about the open source modern data stack everything that means everything that entails um and how we can think about um the data end of the entire stack to help us deliver as much business value as possible and what what that means um if you can introduce yourself in the chat on YouTube that would be fantastic and we'll get started in a couple of minutes um and I see Carol willing in the chat who says um how did Hugo hey Carol it's been it's been a long time and Jason Carroll's sending all the best from notable to Jason as well awesome love the folks to notable X Netflix series great colleagues exactly fantastic all right everyone we'll get started in a minute it's Hugo here from Outer bounds um and I'm super excited to be here today with Jason Reed uh head of product and co-founder at tabula to talk about the open source modern data stack um we also have Alex Lee who's appeared in the chat um who's here from Tasmania um and and finally both people who've commented are both uh Carol and and Alex um I've worked with previously on on similar events to these these types of live streams actually um so that's really cool and we also have uh Molly from Philadelphia so we've got people from all all around the place um great well without further Ado Jason why don't we turn our cameras on and and and and get started sounds good good morning as I'm in Australia and good afternoon to you Jason in sunny California it's yep Sunny and 75 per usual here okay there you go good good to hear um and once again I'd just like to welcome everyone to this um this fireside chat do introduce yourself in the chat and um feel free to ask any any questions along the way as well um just before we get started a bit of bookkeeping um we're going to have um um AMA uh and async asked me anything on slack afterwards uh for any questions we don't get to and anymore that arise um so I'm just gonna uh paste a link to that in in the YouTube chat um and for those who don't know about what we do at metaflow and out of bounds we work on infrastructure and productivity tools for data scientists that allow them to focus on the top layers of the stack on building models and doing science while having easy access to infrastructural layers such as compute orchestration versioning we do this mostly through open source metaflow but we're really excited to be working on on products as well so definitely uh check out um out ofbounds.com and and metaflow and I'm also um just going to paste our we have Sean here from Southern California great um I'm just going to paste our the metaphor GitHub for those interested in checking it out and if that's something that you're interested in please give us a start as well um I also wanted to let you know the next event we're having in a few weeks which I'm very excited about and I've pasted a link to that as well on Eventbrite that you can sign up um it's an introduction to kubernetes for data scientists and machine learning and Engineers um the reason I'm excited about that is um at Medical and out of bounds we don't think data science and mlas should necessarily need to know too much about kubernetes um but if they're interested we want to make sure that they're they're able to so I'm having this live stream with a kubernetes expert from uh out of bounds and meta flow to to talk about um everything you need to know about kubernetes or that would be useful but also we have um Brian Galvin who was at Netflix at the LA times as well and a number of other places who he started as a data analyst uh and statistician um and data scientists working in R and and slowly moved his way to more infrastructural stuff so um we want to tell that story as well so people have a sense of kind of the different ways different career paths that that can happen depending on on your interests so that's a little pitch for our next event for those interested um obligatory if you enjoy this hit subscribe wherever that thing is and and share with all your friends um well that's a enough um out of me I'm I'm super excited to be here with you today Jason and your head of product and co-founder at tabula uh the company behind Apache iceberg that powers Big Data at hundreds of companies um previously you led data engineering at Netflix as director of data architecture so maybe you can tell us a bit about yourself and what you're up to at tabula and Iceberg and then we can can jump in uh sounds good thank you Hugo very glad to be here thanks for the invite let me come on did infrastructure uh something that is exciting to a few of us in the world I'm definitely one of them so yeah uh quickly last two years at tabular co-founded with with Ryan blue and Dan weeks two of my Netflix colleagues and the the authors of the the Apache Iceberg project they often originally at Netflix I can't take much credit for writing the code although I think it was the result of many of the requirements that I was sending their way uh you know leading the data engineering efforts at Netflix and we can get more into uh how all this is related to to our conversation today but um yeah was that Netflix for eight years did all things data engineering and experimentation and uh worked with with vele actually and the medical team they are also Netflix here so it's fantastic people uh great set of technologies that obviously is behind everything that Netflix does with data which is a ton of different stuff um and yeah I've been enjoying running product at Target the last couple of years and basically trying to bring the power of that architecture that we had at Netflix uh to to organize the rest of the other organizations that don't have um you know 50 person data infrastructure teams running around open sourcing things so we'll we'll get to this in more detail but for those who don't know a lot about iceberg maybe you can tell us a bit about iceberg yeah absolutely so Iceberg was uh technology developed at Netflix really to overcome the problems that we were having trying to run our data infrastructure based on the hive table format which is kind of the precursor to Modern table formats uh which Iceberg has won um and the goal there really is much like uh you talked about data scientists not having to focus on low-level infrastructure compute and network and these things they let them focus on on doing the data science work in the business work a similar story for Iceberg it's really to get data analysts and ml folks and data Engineers uh to stop having to think about files and file formats things like parquet and table layouts and petitioning all these very low level details uh and let them up level and just think about solving the business challenges so Iceberg allows you know that kind of abstraction layering up that attraction a little bit and then the other big thing is really allowing all these different data tools in the ecosystem to to play together on the same data sets safely and that's kind of the big difference with something like Iceberg versus historically and that's a big part of I think what we're going to get into today but yeah it's like hey if I have a data set and that's my source of Truth or whatever it is in my business my customers sign ups payments whatever it might be videos watched in the case of Netflix uh you want to be able to use that source of Truth data set across all your tools to do analytics with SQL but to do ml with metaflow and python libraries or spark to do a bit experimentation uh analysis and you don't want to have to make copies of that stuff so that's really the power of iceberg fantastic um and I love that you framed it in terms of you know connecting data Upstream to um business value um and and business outcomes so maybe we can start at a high level and you can tell me when we're talking about the use of data to connect it to business outcomes what what type of business outcomes are we are we talking about yeah I think that's um really where people are starting to come to the realization of hey we collect all this data and we put it in a lake we put it in a warehouse and we were supposed to get value out of the other out of the other side and they didn't always happen uh it doesn't happen automatically turns out um there are lots of different ways that you can get value out of data that you're you're collecting um and processing but there's because there's so many different ways there's lots of different areas of the business and skill sets and tools that come to bear so it can be used to optimize your marketing campaigns it can be used to run a b experiments on on your product it can be used to build personalization algorithms for your you know your videos like Netflix recommendation algorithms it can be used to build large language models if that's what you want to do with them right llms so I think the place where people get stuck is that that's a bit overwhelming there's so many different things that you can do to get business value but how do you actually get to them and how do you do so efficiently and so I think it really is um you know start with one one thing you want to do one thing you want to accomplish uh make sure that that works really well but design that in a way that when you add on the next use case then the next data product you're going to build the next business value you're going to unlock uh that you can do so with reusing as much that original architecture as possible right and that's kind of where we get to this all the open sources things like Iceberg it's like okay well if I had an end to end from uh data collection all the way through uh you know recommendation algorithm personalization for my product that's great what if I not want to take that that same data and just do a high level dashboards for my Executives well I should be able to reuse 90 of that stack and just you know maybe just put dashboard on the end of it so um yeah the key is identifying all the different areas you can provide value and then how do you reuse as much of your architecture and engineering efforts as possible so you're not having to reiment the wheel every time absolutely so I mean something I'm really Hearing in there which I think is super important is that as much as possible we want to have the same single source of Truth as data to serve all our data needs in terms of whether it's experimentation machine learning executive dashboards exploratory data analysis um ml in production whatever it may be yeah absolutely and I think that historically has been um not really technically feasible and maybe only accessible to companies like Netflix and Apple's Etc that could invest a lot of engineering effort into building these single sorts of Truth systems but you know more and more as these open source projects mature as vendors and Commercial products emerge around them you know that architecture becomes available to the vast majority of companies and it's a very powerful uh setup ultimately great and we've got um a wonderful small um Netflix fan club emerging in in the chat Carlos said Netflix is one of has one of the biggest and most robust um info for data they have great blog posts about it too and Nick salerni has chimed in and said they most definitely do um big kudos to their efforts to open source uh a lot of it also so that's really really cool that was like I was so glad to be at Netflix and one of the reasons was like their big investments in open source and for me in my career it was great you know I got exposure to all these open source Technologies and got to become an expert in uh some that aren't that useful like Apache pick which you know isn't super valuable anymore although we use it at Netflix a ton early on but you know a spark and trino a parquet file formats things like Apache Iceberg metaflow so um yeah that's it's great that Netflix both invests in that and best engineering in that and also gives back to those open source projects um if you if you ask my co-founder Ryan blue like why did they open source Iceberg from Netflix I'd like to tell you there was it was very selfish on the part of him and Netflix in the sense that um we built it in Netflix and it was great but we didn't want to be on the hook internally on Netflix for continuing to keep all the Integrations up to date and to add new Integrations for a new compute engine it's like okay it works for spark and it works for trino uh but what about um Druid or metaflow or other things that want to use this data we can build all those Integrations ourselves and maintain them but it was a much better thing to share with the community and then get the whole Community involved in in building new connectivity and maintaining as these birds of things these things increase like there's a lot of entropy in the data world as people know there's a zillion tools and it does take a village taking everything going the right direction absolutely um and I think this framing of you know a suite of Open Source tools dovetails really nicely into what I wanted to talk about next I mean um a lot of our audience are data scientists machine learning Engineers who are really think about what tools to adopt um and as you know you know there's a huge prolifer nation of uh of tools um in a variety of forms and there are you know I mean we can classify them in a number of ways but one way to slice it is that we have you know a bunch of single large proprietary platforms and we also have a tool chain of Open Source um tools that's that's possible so I'm I suppose I'm interested in um for someone who wants to adopt new new technologies um why and maybe when is a tool chain of Open Source tools a more appropriate choice for your data stack than a single proprietary platform yeah that is a great question and and one that um we definitely wrestle with I I think the first is to understand like what are the trade-offs between that decision um and you know historically it was a bit more straightforward you know you were either fully in a proprietary platform like Oracle or teradata right it is all of your compute as all of your storage you can only do what you know those systems allow you to do and that was it and so it was either that or if you could go to open source route fine but it was a Choose Your Own Adventure you had to take all these Technologies you had to station together yourself you had to glue them together with all the infrastructure and you needed a lot of engineering resources to do that so it was expensive uh to to go that route but you gained a lot of power and flexibility and long-term cost cost mechanics because you weren't tied into an expensive uh platform licensing fees and that was Netflix's case like we were a big teradata shop early on and it got incredibly expensive and could really only do SQL analytics and what about all the things we wanted to do and we moved to this open stack but it was you know painful uh and a long Engineering Process um and now we're in this world where you know we have um more even the proprietary systems that have more optionality and more integration so you can take something like Snowflake and that actually has support for Iceberg storage tables so you can use you know that compute and combine it with open source storage or or databricks has something similar with their Delta Lake product even Microsoft getting in on it they have their whole fabric product now right which just came out which is kind of this architecture backed by by uh one Lake which is Delta Lake open file Open Table format underneath the covers which means you can plug in some different different tools to it so we're operating in more of a hybrid world than we were but I still think the choice ultimately comes down to um is is optionality right if you're going to go the The Open Source route you just are always going to have a lot more choices at your disposal you're not going to be locked into choices given to you by by a vertical proprietary sort of Stack uh and the question that becomes is like is that optionality that you gain worth it for what will likely be still even in today which got a lot better but but still a little bit more work stitching these things together and you know the open source you know tools and ecosystems right what tabular is doing and and outer bounds and you know making it easier to use those tools together and so on and so forth trying to make that story better but it really does come down to that do you want the sort of ease of use uh integrated platform everything works relatively well together but you're limited in your choices and you're kind of tied to their cost model so sometimes it can be a cost driver um or do you want to go to the open route openstack um have multiple tools have a lot more optionality more control over costs but you know with the challenges that come through integrating multiple tools so trade-offs to be had for sure I think the big one that maybe people Overlook at is is the the next decision so like you're making this decision maybe in this moment but what about next year when the next new tool comes along right like um how quickly can you can you integrate that and are you going to fall behind competitors because they're able to integrate a new technology they're able to integrate you know generative AI tools that into their tool chain and you can't because you're still waiting for your vendor to to provide something in that space um and making sure that you think about that feature proofing aspect that like the landscape is changing so fast I think that optionality is also about Speed and Agility and time to Market absolutely and this is something that I would have even said five years ago right that a lot of and I've experienced this a lot of data scientists I know have experienced this where when you're locked into a proprietary platform you may not have access to the most recent bleeding edge version of plytorch for example right just plucking something out and um and that that was a challenge several years ago as I've said but now as you mentioned with all the generative AI tools if you're logged into something and you don't have access to this entire new suite of uh parts of the tool chain um that can be you know a strategic disadvantage right absolutely so there's there's that on the technology front and then there's the the ecosystem which is both as individ as individual in this space right I benefited a ton from my exposure and then experience and now uh sort of expertise in open source Technologies had I been somewhere like early my career I was at lucky you know I was like I was doing all Microsoft stuff like like you know and I knew Microsoft SQL really well but I didn't know any of the other tools outside the Microsoft stack that made me much less valuable uh on the open market as an engineer I just had a narrow field of view and a narrow skill set uh and similarly companies who are only using a single technology they can only hire people who know that Tech stack and you know they their pool that they can choose from isn't as isn't as Broad and in Engineers who want to get exposure to open source and broaden their skills that aren't going to be as uh excited about working in a place that's you know in in a single stack so like those are all secondary concerns but very real variable concerns like Netflix was able to attract I think a lot of the top engineering talent because they people knew they could go there and always be working on the latest you know cool Technologies and open source Technologies Etc absolutely um we have a bunch of really interesting questions some are Iceberg specific that I want to get to once we kind of dive deeper into into Iceberg but before that we have a super relevant question from Alex Leith who actually I know from when I when I worked on on dusk he works in uh geospatial um kind of large-scale geospatial stuff in in Tasmania um but Alex asks have you seen um pushback and I think he's referring for your time at Netflix but also anything beyond that have you seen pushback on open source from decision makers in Alex's industry geospatial there's a whole tension between commercial off-the-shelf and and open source and Alex is interested if it's the same in in web scale orgs or any anything else you've seen yeah I think you have that that same um tension in any industry right uh open source versus commercial and I guess now the world's even muddier with you know this sort of hybrid offerings like outer bounds are types there's a commercial entity there um but it's it's all based around open source like core open source stuff or confluent and coffee as another good example um and so you know there's a whole Spectrum there there's definitely tensions between decision makers and engineers and I think there's there's good arguments all around I don't think there's a right answer to these things I think those tensions are actually probably healthy um because they force you to evaluate the trade-offs and what they mean for your business in your situation so you have to you know the answer of it depends well well you know this is incredibly generic and probably uh tiresome is it depends I think it's it's the right answer and those tensions I think are healthy and uh organizations just have to as long as they're willing to be open about the trade-offs and make informed decisions and not just say like oh well I'll never use open source or I only use open source or I only use commercial any absolute position is probably naive absolutely um and it's also worth mentioning that the the hybrid nature of Open Source and managed services and productization of Open Source and vendors based around open source also does support open source in terms of um de-risking for adopters right so one example that I'll give as I said I worked on dasc and worked a company coil that supported dusk um and when a lot of people I know were figuring out whether to go with um dusk or spark early on the existence of databricks pre-coiled and pre-company supporting desk was something which was really important having having a having a phone number to call when stuff doesn't doesn't work um you know is is incredibly important absolutely if you're trying to run a business right and there's not somebody you can call or something broken and you're just on your own that's a very scary place to be something that's going to be core to your to your business so I think having commercial entities around open source is ultimately a very healthy thing uh because it's like well what happens if this is almost this project you know dries up tomorrow and the community goes away and we're pulling the bag nobody wanted that it happened a number of times early in in the sort of Open Source software world and I think we're seeing the okay once you get some commercial uh backing around this that's good at d-risk companies that de-risk adoption it also provides resourcing to continue to develop the project right because open source otherwise is sort of hopefully you'll big companies and hopefully maybe individuals on their free time but that's a lot to ask uh if you're gonna bet commercial money around it but when you get you know an outer bounds or a tabular or confluent or coiled in Dallas this case like then there's there's real dollars there that are being put into the project to make sure that it gets maintained make sure that it gets bug fixed make sure that vulnerabilities are dealt with right those kinds of things that enterprises really care about so I think that's all a good thing and then the question becomes um does it still maintain that open source Community where you get contributions from lots of different people where it's it's not being totally driven by one entity and then you lose a lot of the value about something being open source because then it's just open code which is a different kind of thing right like you can read the code but you you really can't drive anything about the project that loses a lot of the value of Open Source and Community Driven efforts so you know that balance has to be there yeah and yeah to be clear open source is a very broad church so so to speak right I mean do we I mean open code is very different to open governance I mean tensorflow for example is open source of course but it's a very different model of governance than what we think of when we think about I mean it's it's company driven right um so this is really exciting I'm I'm really interested so once we've decided that we want to start thinking about um a tool chain of Open Source tools um for the data stack and tying that to machine learning and business outcomes at the end um what then Canal um what is the tool set what what does it look like how can we think about it yeah uh I mean I think you know you can look at the broad you know data infrastructure diagrams and they all have similar componentry right and so you've got something to to get data into some sort of storage system so ingestion of some kind that can be something like streaming and an open source obviously comes on but there are now other good open source alternatives to that as well um as well as maybe you need to take your transactional systems and move that data in and there's like open source things there like the museum or something for doing like sort of change data capture kind of thing so okay I'm going to move this data and there's really just to get data out of transactional systems out of streams into more long-term stores and it sets up a lot better for a lot of the downstream data use cases you know analytics and all et cetera um there are good use cases for streaming data as well when I'm gonna Side by that because that's a whole different conversation we could have on a different day um so okay you gotta get a date in you got to do ingestion then there's then there's storage right and then so that's where you've got things like Apache Iceberg um Delta Lake uh Apache hoodie these are the sort of the common Open Table formats um that that gives you you know these nice storage properties about I have schema I have schema Evolution I can do I can safely do transactions on this stuff things that you really need if you're going to build Enterprise data products um and after that you get to uh the consumption layers right so this is okay my what's my ml tool chain I have um model training and surveying and versioning a lot of things that a metaflow plays in um you've got other maybe just big ETL type processes something that's that spark has really kind of dominated the market for a long time of just doing transformation at scale uh and then you've got analytic workloads uh ad hoc and a workload something like trinos sort of been the dominant player for a while there are other open source tools as well um and then you know serving uh dashboards and there's all kinds of Open Source like visualization tools notebooks Jupiter and other open source notebook um tools so there's a very healthy uh Suite of tools you can use to cover almost all the bases and they're in those big uh areas I would say and then I think the one bit and hopefully something that we can get into a little bit that's been historically missing I think still is at least there's an open source uh alternative is how do we do um governance across that whole thing and I should say that there are open source solutions for what I'll call Federated data cataloging like let me just you know collect all these data assets and put them some more where I can search through them and and tag them and document them uh those are things like data Hub uh is a good open source thing that came out of LinkedIn or um uh adminson as well which came out of I think Lyft I can't recall I have some good open source tools there um but then the piece that's like to do this at scale in an Enterprise is how we're going to do security as like the big the governance is a big space but in particular how are we going to secure this data and make sure that um you know we can comply with data regulations like this all the data privacy legislation and just like consumer Awareness on data privacy which is really healthy and important uh this is kind of new and and juxtaposing that with this open source tool chain that that's where the rubber bands meets the road for Enterprises and how do we actually do those two things well together and I think that's still a bit of an unsolved uh question there are certainly you know vendor solutions for this something that the tabular plays in um but uh that that's a that's a tricky bit right and I do I do want to kind of drill down soon into the data governance concerns after we talk some more about security and regulatory requirements also I I'm interested in I mean you do work on on on Iceberg so I'm interested in that that's a really nice kind of schematic of the tool chain with tools such as iceberg in the middle um I'm I'm interested in why why Iceberg why would people adopt a tool like Iceberg and maybe speaking to where it sits in the tool chain as well all the way through to you know being able to have a single source of truth of the same data for all of your Downstream needs which I know isn't a really important concern for all of you yeah I mean I think that that's the primary one but then like why aspiration I'm sorry I answered my own question there no that's okay I think that's the most important one and then like maybe why Iceberg specifically versus other choices you might have and I think it's also um why open like why it is an open format for that and then across the open formats then why is I'll try and cover both of those at least in my opinion so I think why why open in particular is so important that at that storage layer at that Central piece is it is because it's at the middle right like you have to bring data into you have to get it out of it in multiple different ways and so the interoperability in order to have the story where we can have a single source of Truth in order for that to work interoperability is the number one concern it has to be able to work well with all the other tools in the ecosystem open tools and Commercial vendor tools right it has to also work with Snowflake and vivekan so um having an open standard is really really important so you see things anything in the tech space and even non-tech spaces where interoperability is the key component Open Standards tend to be the ones that went out something that everybody in the ecosystem can point at and rally around and say like yes we all will benefit by contributing to this open standard like that which you get all these different parties to agree on something is hard but um becomes the big unlock like if we have an open standard here and we can all invest in and we can all trust that it's not going to go in a Direction that's bad for our company or our product um and so that that's really really important and that's why I think you see uh the Open Table formats being at the center of that story um and then why Iceberg and it's related to that is if you're going to have an open standard and it has to be something that many many different companies open source and Commercial companies can invest in in trust it has to be something that's very much Community Driven right that everybody's voices have a chance to be heard and there is a good governance in this case like it's an Apache project but it has to be a really healthy governance model to make sure that people feel that um yeah they can contribute effectively and they can build stuff on top of it and not have it ripped off from underneath them six months down down the line so that's where I think Iceberg really shines it has an incredibly healthy Community industry players Netflix you know Apple LinkedIn Airbnb Etc as well as commercial interests you know Snowflake and Claire and IBM uh maybe the the clouds as well uh AWS and and um and Google so you've got all these parties that have all said yeah this is an open standard we can all invest in this thing nobody controls it right it's got a good governance model um and we can build the future on top of this open standard because it's a win for everybody to this architecture of single source of Truth which I think is what the marketplace really really wants so that's you know how I see why Iceberg because all the technical reasons that iceberg is an amazing like technology in and of itself but I think the real reasons are actually non-technical reasons of why why great we've got um a question from Nick salooni Nick's curaca's your views on migration to Iceberg from other potentially Legacy Solutions so specifically coming from the angle of engineering effort for small and large teams to migrate yep um so you know data migrations are every debate of everybody's existence might include it um you know most people coming to Iceberg because it is a ultimately parquet uh data files underneath the coverage right are usually coming from some sort of parquet based uh storage or like I but maybe you're coming from from Snowflake or some proprietary storage system as well and there's a couple things there since it is based on top of parquet the migration path is mostly about um building the iceberg metadata the table metadata which turns it into a a really usable uh interactive table on top of existing parquet file so that's the simplest thing you can actually do this migration almost in place where you're just reusing your parquet data files and building metadata on top um otherwise the migration to Iceberg looks like it does in kind of any other data warehouse or database migration where you're sort of Select star from old insert into new and that obviously works and the iceberg case there's like it doesn't in any database um but there is a slightly cleaner path if you're coming from an existing parquet or orc based um based data warehouse there's there's some there's some streamlines into that into that story and that's like the Netflix case uh and apple similarly yeah great um and just quickly if for those who haven't uh used Iceberg before and want to find out more um where can where can they do that yeah um uh the iceberg Apache website is a great place to go a documentation there the very front page of that has uh also uh how to get into the slot Community the iceberg clock Community is I think we just recently crossed 3 000 users in there and it's an incredibly active Place uh for both newcomers asking you know intro questions as well as experts and experts giving answers and so again Community aspect is really healthy healthy there um and obviously like if you uh want the easiest button you know tabular is a great place to start we have a free tier you can go there and click sign up and you can get started with Iceberg tables you know in a matter of minutes so uh I'd be remiss if I didn't at least plugged out a little bit um absolutely great so I've actually shared the links in the chat to both the Apache Iceberg page and to uh tabula.io as well thank you um so we've mentioned a bit around um governance which we'll get to before that though I'm interested um especially in kind of a changing landscape of security and regulatory requirements how you can make sure that your data and pipelines meet security and regulatory requirements yes I think this is a bit of a mindset shift architecturally right as somebody who's like been working in data infrastructure and data architecture for a long time you know in a world where um your compute and your in your storage were bundled together you could sort of secure that and you do that in one system like in teradata you set up your roles and your permissions and thousand people have access to data and you could sort of go for this um least privileged access Concepts and it was cumbersome still because like how do you get access to data that you need access to you have to go through you know request flows and approvals Etc but you could lock it down and then you could audit it right you could you had an audit trail of all the activity against the data who had access to what and when they got access Etc and that was good from a security perspective and then when you want to do things like like governance um like say hey right to be forgotten type of rules if you know somebody customer raised their hand and said hey you no longer can have my data you'd find their customer ID you'd go through your warehouse and delete everything with ID equals Json and you could be relatively sure that you were meeting the requirements that became really really messy when we started collecting data in these open formats right it's actually pretty even Iceberg you'd be doing this on top of S3 or something uh how do you effectively do security especially when you've got multiple different ways to access data if we're in this architecture which we all want which is a single source of Truth and and read it from multiple different tools how do we successfully lock that down or even know who has access to what how do we audit it who's accessing the data like really difficult stuff I don't think we have you know great open source Solutions yet at least a tablet we've built something where the idea is to move that security from being in the in the compute layer in the tool that's asking the data and put it on the actual data itself right so this is um like the distributed Cloud version of how your file system on your laptop works right it's like you have uh documents on your on your hard drive and certain users who have access to your laptop have access to those documents so it's not that you say Microsoft Word has access to this doc but the text editor doesn't like that's not how it works like you secure the file and then you know all that access goes through the applications but the but the file is ultimately the thing that's secured not the application and so same thing now we move to these distributed data architectures we need to start to move the security layer down to the physical storage um and and make those compute layers uh basically come with some sort of authentication and authorization to the storage and say like hey I'm Jason here's my authorization that says I should have read access to this data can you can you please let me in right and no matter if I'm coming from a python process or I'm coming from a spark job or I'm coming from a trino query like that same like uh exchange has to has to happen so I think that's where we're going to move to if we're going to be successful in this in this architecture um and it's definitely a challenge and it's definitely a shift but that'll give us to at least at least security and then we can audit it right if all that security is happening at that storage layer we can go get back to getting a nice clean audit then there's the secondary concerns about what about gdpr and those kinds of things when customers like hey uh forget that I you know you have data on me um and that's yet another Suite of capabilities that we need to build out in this in this architecture uh it's it's things like having really good lineage right where where do all this customer records go how many different tables did they show up in how do I actually effectively find all that data and then how do I remove it efficiently uh when they had these column narrow file formats so I'm I'm the data's all organized to do bulk analytics and now I'm trying to do record level you know inserts and deletes and things of the nature difficult stuff to do um and Iceberg and and all the other table formats have have added capabilities to do like record level operations um but doing all that efficiently at scale an Enterprise type of environment and keeping track and and being able to prove to some auditor or regulator that you've complied I think we're still early stages of doing that kind of stuff and how do we even think about this when you know we have data but getting new data so you know with feature stores and Metric stores and all of these different types of things and you know people doing machine learning doing Transformations and that type of stuff and having that stored in a variety of different different places I think these are great questions I think this is where you know the regulations are always going to be behind the technology I mean what does it mean if um you take all of my my search history and you turn that into some embedding and you use it in a model and then I tell you hey you need to forget about me okay you can drop the records in my search history maybe boo but you use all that data to build your model that is that is that up for grabs you know in these large language models that have been built on you know the Corpus of texts that have been what did these people give permission for that to be used in the model these is like Uncharted Territory so Regulators are going to have a field day trying to trying to keep up I'm sure and then uh you know what can the what can the architecture and platforms and Technologies do to help make sure that we can still leverage our data and build these super valuable data products but do it in a way that meets the government governance requirements in a fast-changing world so that's sort of like that thinking about that future proofing of your architecture you should think about some of those future problems and how you might solve them with whatever you're building today because those problems are definitely going to come your way at some point and and we've got to deal with this I think again it goes back to um being able to have really good lineage and understand how these data products are built how they're connected the connectivity across all these things um and then and then you know ways that we can effectively both do the governance and then audit it and make sure that we're meeting requirements yeah yeah I'm glad you mentioned um the existence of models that are trained on data as well large language models are one family right but I don't know whether this is actually occurred in any large-scale way but you know there's a whole fascinating area of research on extraction attacks of taking models and being able to extract training data um from from the models right which you know is pretty pretty out there it is and it's a new Vector of attack and I think yeah we're at the we're at just the very early stages of the sophistication here but um we'll see time will tell but yeah thinking through these future States should at least like even if you can't predict exactly what's going to happen uh as good Architects we should try and anticipate these future requirements and and you know give ourselves out right our architecture at least technical architecture is a lot about is about having two-way door as an optionality right or you can give yourself an out don't don't design yourself into Corners that you can't get out of or will be big migration efforts you know big Tech debt polls that you've created for your for yourself or your company down the line because you didn't think about possible outcomes for sure um Carol willing has just noted in in the chat that you know of course these are these are hard problems from an open science research standpoint more more generally as well yeah I agree super super problems and um I think that means we'll all have jobs for the next 10 for little ears kind of figure them out um so yeah I'm excited for it um I think you know we'll find a way to find a balance between the the power and the value of these things and then hopefully balance that with doing it responsibly I hope absolutely um this really leads really nicely into the conversation I was hoping to have around data governance um and and what you all are up to it tabula as well but um why is data governance important why do we need to think about it and and what type of tooling exists around it yeah I think uh all it's important for all the reasons that we've just talked about and Beyond just like hey we need to comply with the government rules and we need to stay out of out of lawsuits which is you know a really big risk for for Enterprises like we've all seen the headlines um you know there's there's big dollars at play for for violating uh regulations so you need to be able to comply with that stuff and it's only going to get harder and more complicated as we go forward um but but beyond that it's just even just in order to make sure that you can get value out of your data products and then you trust I think trust is a big part of it not only trust of your your consumers or data you're collecting but internally uh your users being able to trust the data that they're using right so governance is a lot about trust yeah if I'm using a feature from a feature store to you know deliver value or try to say something about ground truth about our business I want to make sure that I know exactly how the the generator process for that that data right absolutely like that provenance like you know this is a bit like in science so you need to be able to know like where did where did this this data come from is this reproducible this results right and in order to do that it's about trust and about being able to walk the chain back and say where does data come from what kind of Transformations did it go through has it changed the last time I looked at it yeah uh who who is capable of changing it right security and access things uh all that stuff it matters and it's all around this concept of trust trusting answers trusting models reproducibility right um that's a big that's a big part of it um Carol willing has a great general question how is a company do you instill Trust yes back to the human aspects we've talked more about humans than technology this is probably a good thing you go yeah I mean trust is still a is a human thing uh all the tools and Technologies and infrastructure in the world is not gonna get get you trust so uh trust is built in a way in data with data in ways it's built across all humans it's um and we're getting the philosophical stuff now and sometimes probably not best to speak to but you know my experience it's been um you know does does this thing do what it says it's going to do right this of the Integrity pieces both as humans and Technology if you say it's going to do X that actually do X reliably right or does it do it X sometimes and why other times that that doesn't help um is it available so you know availability is a big part of it right if this data set like can I actually use it when I need to or is it like oh well uh maybe on Mondays it's working but on Tuesdays the data the pipeline broke and now we can't use it anymore it's ale or any of these things right so yeah repeatability availability those are the trust factors and then ultimately you know still it's a human to human thing more than anything that I've seen in like Netflix uh you trusted data because you trusted the team that produced it more than anything else yep um so now we know a lot about why data governance is important what type of tooling exists around it have this there's lots of different tooling and governance is such a big space right I think you've got um tools that all cover bits and pieces and then there's lots of overlap like most data like most data tools I think there's the thing it was designed to do and then there's the spaces that it leads into around the thing it was designed to do um so you know the the most basic piece is just uh do I have a catalog of all the data assets that exist in my organization right transactional streams analytic data sets uh features models right there's uh dashboards there's so many different types of resources I mean can I just get a listing of all those things do I know it all exists out there and so there's I call those you know their data catalogs they're Federated data catalogs and you know these are the tools like data Hub and and adminson Etc um and then there is uh and hopefully I've been I'm gonna assume that those also come with some amount of of documentation about what they are so I can list them but I can also I can also get some information about what this thing is hmm the next set of sort of that is well where did this come from so this gets into like the lineage aspects I think that's the next sort of important piece of like okay how are you have a listing of things kind of know what they are how are they related like what's the sort of the graph that connects all the all these data resources um so we call them call that lineage but he had some sort of graph that says like here's all these things are connected uh and and that's cool um and then you know you just sort of keep layering on that the next thing would be all right well um what's responsible for actually making those connections there's SQL jobs or ETL things or models that are being trained right there's all kinds of different ways these things are connected uh what are what are those things uh and where do I find you know the business logic behind th

Original Description

Jason Reid is Head of Product and co-founder at Tabular, the company behind Apache Iceberg that powers big data at hundreds of companies. Previously, Jason led data engineering at Netflix as Director of Data Architecture. In this fireside chat, Jason joins Hugo Bowne-Anderson, Outerbounds’ Head of Developer Relations, to discuss how to think about and build a unified, enterprise-grade data platform for diverse workloads, including classic ETL and ML using open-source components. Given a cloud-based data platform, how can you ensure that it robustly serves data engineering needs, machine learning, AI, dashboards, exploratory data analysis, and many other modern use cases of data? After attending, you’ll have an understanding of - Why a toolchain of open-source tools may be a more appropriate choice for your data stack than a single proprietary platform; - How to use the modern OSS toolset built around Apache Iceberg to access and transform data, - How to connect your data to your machine learning pipelines and workflows; - How to make sure you meet security and regulatory requirements; - The importance of data governance and the tooling around it. And much more! The fireside chat will be followed by an AMA with Jason and Hugo at slack.outerbounds.co. 00:00 Prelude 03:59 The fireside chat begins! 07:15 In which Jason introduces Iceberg, Tabular, and ..... himself 10:38: From data to business outcomes 13:11 The importance of a single source of truth for data 16:00 Best-of-breed vs all-in-one platforms for data tooling? 22:29 Pushback on OSS tools from decision makers? 27:04 What the OSS modern data stack looks like! 31:31 Why Iceberg? A high-performance format for huge analytic tables 37:58 Making sure your data and pipelines meet security and regulatory requirements 46:15 The unreasonable importance of data governance 54:45 Drilling down into the technical aspects of Iceberg
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Playlist UU5h8Ji6Lm1RyAZopnCpDq7Q · Outerbounds · 39 of 60

1 Metaflow GUI for monitoring machine learning workflows
Metaflow GUI for monitoring machine learning workflows
Outerbounds
2 Metaflow Cards [no sound]
Metaflow Cards [no sound]
Outerbounds
3 Fireside chat #1: How to Produce Sustainable Business Value with Machine Learning
Fireside chat #1: How to Produce Sustainable Business Value with Machine Learning
Outerbounds
4 Fireside chat #2: MadeWithML.com -- Teaching Practical Machine Learning
Fireside chat #2: MadeWithML.com -- Teaching Practical Machine Learning
Outerbounds
5 Metaflow on Kubernetes and Argo Workflows [no sound]
Metaflow on Kubernetes and Argo Workflows [no sound]
Outerbounds
6 Fireside chat #3: Reasonable Scale Machine Learning -- You're not Google and it's totally OK
Fireside chat #3: Reasonable Scale Machine Learning -- You're not Google and it's totally OK
Outerbounds
7 Metaflow Tags: Programmatic Tagging
Metaflow Tags: Programmatic Tagging
Outerbounds
8 Metaflow Tags: Basic Tagging
Metaflow Tags: Basic Tagging
Outerbounds
9 Metaflow Tags: Tags in CI/CD
Metaflow Tags: Tags in CI/CD
Outerbounds
10 Metaflow Tags: Tags and Namespaces
Metaflow Tags: Tags and Namespaces
Outerbounds
11 Metaflow Tags: Tags and Continuous Training
Metaflow Tags: Tags and Continuous Training
Outerbounds
12 Fireside chat #4: Machine Learning and User Experience -- Building ML Products for People
Fireside chat #4: Machine Learning and User Experience -- Building ML Products for People
Outerbounds
13 Fireside Chat #5: Machine Learning + Infrastructure for Humans
Fireside Chat #5: Machine Learning + Infrastructure for Humans
Outerbounds
14 Metaflow Sandbox Demo: Free Data Science Infrastructure In the Browser
Metaflow Sandbox Demo: Free Data Science Infrastructure In the Browser
Outerbounds
15 Metaflow on Azure
Metaflow on Azure
Outerbounds
16 Fireside Chat #6: Operationalizing ML -- Patterns and Pain Points from MLOps Practitioners
Fireside Chat #6: Operationalizing ML -- Patterns and Pain Points from MLOps Practitioners
Outerbounds
17 ML engineering vs traditional software engineering: similarities and differences
ML engineering vs traditional software engineering: similarities and differences
Outerbounds
18 Why data scientists love and hate notebooks: velocity and validation
Why data scientists love and hate notebooks: velocity and validation
Outerbounds
19 What even is a 10x ML engineer?
What even is a 10x ML engineer?
Outerbounds
20 The 4 main tasks in the production ML lifecycle
The 4 main tasks in the production ML lifecycle
Outerbounds
21 Is the premise of data-centric AI flawed?
Is the premise of data-centric AI flawed?
Outerbounds
22 The 3 factors that Determine the success of ML projects
The 3 factors that Determine the success of ML projects
Outerbounds
23 Fireside Chat #7: How to Build an Enterprise Machine Learning Platform from Scratch
Fireside Chat #7: How to Build an Enterprise Machine Learning Platform from Scratch
Outerbounds
24 Run Metaflow on any cloud: Google Cloud, Azure, or AWS [no sound]
Run Metaflow on any cloud: Google Cloud, Azure, or AWS [no sound]
Outerbounds
25 Metaflow on GCP
Metaflow on GCP
Outerbounds
26 Fireside Chat #8: Navigating the Full Stack of Machine Learning
Fireside Chat #8: Navigating the Full Stack of Machine Learning
Outerbounds
27 How to Build a Full-Stack Recommender System
How to Build a Full-Stack Recommender System
Outerbounds
28 Modernize your Airflow deployments with Metaflow - zero-cost migration [no sound]
Modernize your Airflow deployments with Metaflow - zero-cost migration [no sound]
Outerbounds
29 Easy Airflow DAGs for ML and data science with Metaflow [no sound]
Easy Airflow DAGs for ML and data science with Metaflow [no sound]
Outerbounds
30 Fireside chat #9:  Language Processing: From Prototype to Production
Fireside chat #9: Language Processing: From Prototype to Production
Outerbounds
31 How to build end-to-end recommender systems at reasonable scale
How to build end-to-end recommender systems at reasonable scale
Outerbounds
32 Full-Stack Machine Learning with Metaflow on CoRise
Full-Stack Machine Learning with Metaflow on CoRise
Outerbounds
33 Natural Language Processing meets MLOps
Natural Language Processing meets MLOps
Outerbounds
34 Fireside Chat #10: Large Language Models: Beyond Proofs of Concept
Fireside Chat #10: Large Language Models: Beyond Proofs of Concept
Outerbounds
35 What even are Large Language Models?
What even are Large Language Models?
Outerbounds
36 How to get started with LLMs today
How to get started with LLMs today
Outerbounds
37 LLMs in production
LLMs in production
Outerbounds
38 Accessing secrets securely in Metaflow [no audio]
Accessing secrets securely in Metaflow [no audio]
Outerbounds
Fireside Chat #11: The Open-Source Modern Data Stack
Fireside Chat #11: The Open-Source Modern Data Stack
Outerbounds
40 Fireside chat #12: Kubernetes for Data Scientists
Fireside chat #12: Kubernetes for Data Scientists
Outerbounds
41 Behind the Screen: How Amazon Prime Video ships RecSys models 4x faster
Behind the Screen: How Amazon Prime Video ships RecSys models 4x faster
Outerbounds
42 Fireside chat #13: Supply Chain Security in Machine Learning
Fireside chat #13: Supply Chain Security in Machine Learning
Outerbounds
43 Quick Delivery, Quicker ML: DeliveryHero's Metaflow Story
Quick Delivery, Quicker ML: DeliveryHero's Metaflow Story
Outerbounds
44 Crafting General Intelligence: LLM Fine-tuning with Metaflow at Adept.ai
Crafting General Intelligence: LLM Fine-tuning with Metaflow at Adept.ai
Outerbounds
45 Fuelling Decisions: How DTN Powers Gas Pricing and Data Science Collaboration
Fuelling Decisions: How DTN Powers Gas Pricing and Data Science Collaboration
Outerbounds
46 From Kitchen to Doorstep: Optimizing Data Science Velocity at Deliveroo
From Kitchen to Doorstep: Optimizing Data Science Velocity at Deliveroo
Outerbounds
47 Building a GenAI Ready ML Platform with Metaflow at Autodesk
Building a GenAI Ready ML Platform with Metaflow at Autodesk
Outerbounds
48 Media Transcoding for 10 Million users and beyond with Metaflow at Epignosis
Media Transcoding for 10 Million users and beyond with Metaflow at Epignosis
Outerbounds
49 Telematics with Metaflow: How Nirvana Insurance built a large-scale Risk Estimation platform
Telematics with Metaflow: How Nirvana Insurance built a large-scale Risk Estimation platform
Outerbounds
50 Fireside chat #14: Generative AI and Machine Learning for Film, TV, and Gaming
Fireside chat #14: Generative AI and Machine Learning for Film, TV, and Gaming
Outerbounds
51 The Past, Present, and Future of Generative AI
The Past, Present, and Future of Generative AI
Outerbounds
52 Building Production Systems with Generative AI, Machine Learning, and Data
Building Production Systems with Generative AI, Machine Learning, and Data
Outerbounds
53 A Custom Fine-Tuned LLM in Action (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 5)
A Custom Fine-Tuned LLM in Action (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 5)
Outerbounds
54 Building Live Production Systems with RAG (LLMs & RAG: An Interactive Guided Tour Part 4)
Building Live Production Systems with RAG (LLMs & RAG: An Interactive Guided Tour Part 4)
Outerbounds
55 Better Relevancy with RAG (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 3)
Better Relevancy with RAG (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 3)
Outerbounds
56 Working with OSS LLMs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 2)
Working with OSS LLMs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 2)
Outerbounds
57 Hitting OpenAI and Other Vendor APIs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 1)
Hitting OpenAI and Other Vendor APIs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 1)
Outerbounds
58 Production Systems with Generative AI (LLMs, RAG, & Fine-Tuning: An Interactive Guided Tour Part 0)
Production Systems with Generative AI (LLMs, RAG, & Fine-Tuning: An Interactive Guided Tour Part 0)
Outerbounds
59 LLMs in Practice: A Guide to Recent Trends and Techniques
LLMs in Practice: A Guide to Recent Trends and Techniques
Outerbounds
60 Metaflow for distributed high-performance computing and large-scale AI training
Metaflow for distributed high-performance computing and large-scale AI training
Outerbounds

The video discusses the open-source modern data stack, focusing on Apache Iceberg, Metaflow, and other tools for data analytics and machine learning. Jason Reid shares his expertise on building a modern data stack, data governance, and data security.

Key Takeaways
  1. Build a modern data stack using open-source tools
  2. Implement data governance and security measures
  3. Design data architectures for scalability and reliability
  4. Use Apache Iceberg for data storage and management
  5. Utilize Metaflow for model training and versioning
💡 The open-source modern data stack offers flexibility, scalability, and reliability for data analytics and machine learning, with Apache Iceberg and Metaflow being key tools for building a modern data stack.

Related AI Lessons

Chapters (11)

Prelude
3:59 The fireside chat begins!
7:15 In which Jason introduces Iceberg, Tabular, and ..... himself
13:11 The importance of a single source of truth for data
16:00 Best-of-breed vs all-in-one platforms for data tooling?
22:29 Pushback on OSS tools from decision makers?
27:04 What the OSS modern data stack looks like!
31:31 Why Iceberg? A high-performance format for huge analytic tables
37:58 Making sure your data and pipelines meet security and regulatory requirements
46:15 The unreasonable importance of data governance
54:45 Drilling down into the technical aspects of Iceberg
Up next
Spreadsheet Guy Meets the CFO: "Define How Much"
Digital Transformation with Eric Kimberling
Watch →