Unity Catalog Implementation & Evolution at Edward Jones

Databricks · Advanced ·🔄 Data Engineering ·12mo ago

Skills: RAG Basics80%RAG Evaluation70%Vector Stores60%

Key Takeaways

The video discusses Edward Jones' implementation and evolution of Databricks' Unity Catalog, highlighting their transition from Cloud V1.x to Cloud V2.0 and the challenges faced during the initial setup, as well as the improvements planned for the future, utilizing tools such as Databricks, Azure cloud, and Unity Catalog.

Full Transcript

Hello all, this is Darra. I'm part of the solution architecture team of uh data and analytics at Edward Jones. Um we'll be talking about the Unity catalog implementation and its evolution at Edward Jones. U I want to start with this quote. Somebody stole this quote this in the morning but anyway so the great growling engine of change is technology uh this is from Alvin Toffler so he's a futurist writer and the businessman so this phrase captures the powerful sometimes the disruptive but ultimately the transformative role of technology uh plays in the shaping of the modern world and uh you know the corporate so being Edward loans. Um it's 100 plus year old the fortune 500 company with we do have around like 20,000 plus uh financial advisors and they own a branches uh with two plus 2.5 trillion assets under care in North America both in I would say like in every county of the Americas like you know in the um US and most of the counties in the Canada we do have the branches. Um so with the technology the whole transformation that is happening uh we are in the greatest transformation of uh uh evolution of generation to come. So as part of the digital evolution of course uh data and analytics is in the forefront of this transformation journey. We would be talking about analytics hub. It's a platform uh that serves as the central repository for the the you know the whole Edward Jones analytics and the reporting system. So the firm data and the third party data that is from the outside and all. So we're going to ingest transform and store it to enable the analytics practice to run their uh shows. So, so at the Edward Jones like you know in the analytics practice so we do use the data bricks which is one of the unified platform that runs inside the Azure cloud. So here is the cloud analytics journey just I wanted to just you know call out. So back in like 2022 we had a big presence of the Hadoop in the on-prem. So we activated the data bricks with the data lake for you know Hadoop decommission. Then like eventually the year year end of that like you know 2022 we started slowly onboarding the other analytics use cases. Then like two years back like you know 2023 we came like few few of my peers and like we like you know we came to the same conference and that's when the whole unity catalog was starting and we did talk to some uh deep dive discussions with experts here. Then like we went back to the home and like you know we started slowly analyzing like you know how to come up with a design. So ultimately like you know help with the data bricks and uh you know RSAs and all we designed and deployed in early part of 2024. So then like the first thing we did is like you know our wonderful engineering team they we have the data injection framework where like you know all sort of data source like we'll get into the data lake we enhance that to you know register like you know whatever the data assets that comes in whether it is a um flat files like you know the not image per se but in general like you know we we ingest in and either we'll save it as a tables or the views or the volumes or um functions for the transformation purpose and all. So all those things we started registering to the UC. So that was a big uh you know lift and shift to the unity catalog but eventually we had some presence of the high meta store that was the preuced to the unity catalog. So now like in the last one one and a half around like I would say like you know 14 to 15 months we have around 6,000 when when we like 3 months back when I wrote this deck like you know it was 5,000 now around like 6,000 of securable objects plus more than pabyte worth of data that is there. So this so far starting from like 2022 to date I we call it as a you know the cloud v1.x because like we had like some enhancements with the existing one. So now more and more data traction is happening lot of applications they want to use the you know try their use cases on the cloud with the data bricks and data lake. So major enhancement we need to do like you know major scaleout need to happen. So that's where like you know we we are calling it as the cloud v2.0 so with major scale out and of course we are planning for the whole disaster recovery uh in the year end. So this is the you know the current setup I would uh I would say um there where you can see like in general like you know different data sources that we are getting into the our analytics hub and the middle layer is the the data lake and the warehouse that you are seeing is the analytics hub. It cohabits with like you know lot of components in the ecosystem. We do have the data source on the left and the different patterns like injection patterns that is in the data movement and of course we have the orchestration layer the data security the data governance most of them are the enterprise that's not only for the analytics hub it has been used other components also and on the right hand side is the engagement layer so in general like so the mission statement of the analytics hub that we have is the you know the analytics hub provides a dynamic analytic ICS environment powered by a rich and data landscape to build and maintain a data platform optimized for analytics with highly trusted and accessible data. So to achieve this mission statement right so the first one is this whole cohabit of our enterprise components with the analytics hub and of course the the median architecture that we followed. So if you look at these different components right so what happens is we have the data classification and data cataloging that's the enterprise one like you know you can see the catalog calibra there that scans and collects the metadata from all the you know data services the data lake the de house and you name it and those things those metadata will be harvested and sync into the data security the immutize our data policy tool Then also the the users and their attributes the AD and all those things also sync into the immuta. So immuta is is going to come up with a generate a policy the data policy that's going to push it onto all the data services. it it is a datab bricks UC it's the warehouse it's the you know the snowflake uh the Salesforce uh not Salesforce I mean to say the starburst and like when next time the user when they log into any of these data services let's say to the data bricks you see based on his or her you know the ad the policy will be imposed and whatever they supposed to access right only those things will be they will be able to see like the masking and unmasking and all those things happens. So that that is like you know in general the cohabit of the anal uh the data lake. So this one is like you know the the whole medelian architecture how we design. Um so we have the bronze layer that's where the all the data just comes and land here and we maintain it in original format. Then of course you can use it for you know historic uh purpose like if you need like you know reprocessing or some something of that sort. The next is the silver layer. We have two cataloges. So one is like you know if there are enterprise type of data sets if it is coming some of the foundational data that we try to align into that enterprise catalog and there is a if there is a a point solution which is not enterprise very specific to a solution. So that is like you know we'll align it to the uh the point solution catalog. So those are follows the our enterprise and business taxonomies like you know at the schema level. Then the core layer is more of you know the application catalog that follows the the portfolio and the product like you know how the the data consumption happens. We align based on that and of course there is a utility layer that's where like we maintain all the configuration of all the layers into it and we do have couple of business related catalogs but it very small scale per se and all the data assets we have is been cataloged to the you know the the calibra that is our enterprise data catalog any data assets anywhere in the data lake warehouse anywhere it has to be uh cataloged there that is that we are doing it right know more as a manual scanning and for our data science friends as usual like you know we may not have the quality data in the lower environment. So we do have a infrastructure process we define and like where the data scientists can go and like you know build their model train their model for the with the production quality data once they are good with their thesis then we can follow the regular CI/CD to make it operationalized. I mean at at current like you know we do have early stage of integration with our other major uh the tools like the CRM like we do have the sales force and all those things we are in the very early stage so in the last as I said like you know 14 15 months of the unity catalog was do you think like you know it's a perfect solution no we we came across many challenges the limitations and all just want to call out some of those limitations and the lessons learned the cloud v1.x X. So the whole thing was we built on the single storage account and the user provisioning is is just happening at the schema level and of course some of the schemas are highly skewed especially when I said the you know median right the gold at the right side is heavily crowded with lot of business units they have their own schemas uh it it's lot of things are there and all the we try to fit in like you know all the u use cases like especially the gold layer everything into the medallion. So unfortunately each of those things they have their own SLAs the RTO RPO um and like as I said like you know we do have the limited number of cataloges and we have both the managed and the external tables and the volumes. So all are stored in the same storage account right now. So this is like you know becomes really difficult when we want to untangle it. I can do some sort of disaster recovery or any of those things. So now with the lessons learned so we started planning for like you know how to come up with a better approach and like you know for the wider scaleout kind of thing. So this is the moving forward from v1.x to the v2.0. So in general I mean there are of course there is a left and right there is a data source and the data consumer but this is the only the heart you have put it here. So we are adding more of the UC cataloges. So along with the current medelian cataloges. So if you see on the right hand side like you know bun catalog means like you know business units one to end catalogs you may have like um more and more we are expecting. So that's going to come and so isolating all these cataloges with the separate storage layer. Earlier we used to have one storage account right. So now for each of the catalogs like you know medelian or the business units they will have their own storage account more of a you know the ver horizontal and of the the vertical and of course there is a the second and the third row if you see the second is more of a you know the external the third is a managed. So those are also we are separating it out. It's more of if somebody like a business unit XYZ if they come to us we'll call it as a by OSA like bring your own storage account. If they want all managed that's fine just we'll have one. If they want the external and the manage they may need to bring the two storage account and like you know this is helping in general to achieve our data mesh methodology for all our business units and all they can do it as a data as a product and they will have more control onto it and u we we are working on a process to like you know uh refine more of if anybody wants a new catalog or the schemas they can do it themselves with a you know self-service mode like you know uh All our platform team they are working hard on that. Along with that like we want to uh enable the foreign catalog for the wider data sources for uh other things and of course the delta share is also in the pipeline and we we started already using like you know some of the iceberg tables. So now as a new when we approach the cloud we 2.0 By default everything will be the delta and the iceberg especially the silver and onwards and and the genie of course we are planning to use it in general um for any of the you know responses intelligent responses. Um it so with this like you know it is really helping us out for untangling the whole u the managed versus the uh the external tables and that is helping for the disaster recovery also. So this is the disaster recovery that we are planning in general. If you see on the left hand side and the right hand side there is a data source on the left and the consumer on the side right and you do three like you know bunch of circles there. So on the right side like you can see the legends like circle means the the compute and the storage for the each layer. Um and of course we have the Azure managed replication and the datab bricks managed replication. So on the top is more of a region um what do you say like more of a primary region the the bottom is the secondary region or the failover. So the DNS probably uh data bricks is going to announce maybe tomorrow. Uh they are coming up with their own the meta store level replication for the disaster recovery that's what like you know we are planning to use so and all of we are going to get our own DNS. So those DNS or the stable URL helps to you know isolate like whether it's the primary region the secondary region doesn't matter if any of the the push from the upper system or consumption from the downstream system we we will make sure like you know they use the DNS that would be provided by the data bricks and if there are any pull for the data lake let's say the source is a kafka we may have to rely on the the kafka's DNS of a stable URL so that like you know if they switch over to their secondary region it will not be hampered. So that's about the DNS and in general like you know the raw or the the bronze that I said we have mostly the external tables mostly the external volumes we do have right now all the non-delta we're going to rely on the the Azure managed services either the GCRS or the GRS depends on the use cases and the rest like you know the silver the gold any of the business units we're planning to use the datab bricks managed uh you know replications it's It's a combination of uh delta share and the deep clone. Um hopefully next year even those things will be taken care by data bricks. We we are hoping red I'm counting on you. So in general just key takeaways just I want to call out here is on the left hand side the cloud v1.x the limitations that we had how we are overcoming with the new add-ons. So the just want to call out is like you know the medallion architecture and the business units were all under single storage account. So those are now separated then the fewer catalogs and the crowded securable objects whatever we had in the 1.x so we are taking the approach of more of a balanced so lot of things from the gold layer we are carving carving out to have their own business units and business catalogs kind of thing. Um then of course the managed and the external tables they used to be in the separate same storage account. So now in the future like you know we are going to separate it out both for the managed and external as a like you know bring your own storage account. Right now the limited integration with the enterprise suites like all our the major core components of the uh firm. So those things like we are working on to you know make it more uh wider adoption data share kind of thing you through the foreign catalog through the delta share and uh so many of those things and at current like you know we are not optimal for the disaster recovery but we are planning for the disaster recovery by the end of this year and of course um the whole the new 2.0 hardware that we are going will be all driven by the terraform scripts and the data bricks access bundles up to some extent we are doing it but the platform and the engineering team they are coming up with a plan to like you know make it more of a do it yourself and uh that's the plan so the finally I'm here like you know representing a very bigger team of uh the the platform the engineering the solution architecture and our wonderful technology partners and um amazing data bricks you know the our account executive Brett is here and David is missing somewhere so yeah so these are all because of their help and uh you know we are here at this stage um in general like in the continuation of my quote in the beginning again like Jamie he stole my quote in the beginning but it's okay so I I guess like even this also he repeated up to some extent but with a different word so this is My thought though, so as technology evolves, so do our opportunities. So changes are constant and the key to succeeding in this fast changing world is learning to adapt as quickly as things evolve and like appreciate all your feedback, survey, anything. If you get it, just say something so that it helps me to, you know, get better next time. Thank you all.

Original Description

This presentation outlines the evolution of Databricks and its integration with cloud analytics at Edward Jones. It focuses on the transition from Cloud V1.x to Cloud V2.0, which highlights the challenges faced with initial setup, Unity Catalog implementation and the improvements planned for the future particularly in terms of Data Cataloging, Architecture and Disaster Recovery. Highlights: Cloud Analytics Journey Current Setup (Cloud V1.x) Utilizes Medallion architecture customized to Edward Jones need. Challenges & limitations identified with integration, limited catalogs, Disaster Recovery etc. Cloud V2.0 Enhancements Modifications in storage and compute in Medallion layers Next level integration with enterprise suites Disaster Recovery readiness Future outlook Talk By: Dattatri Rao, Technical Architect, Edward Jones Here’s more to explore: Unified and open governance for data and AI: https://www.databricks.com/product/unity-catalog See all the product announcements from Data + AI Summit: https://www.databricks.com/events/dataaisummit-2025-announcements Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Databricks · Databricks · 0 of 60

← Previous Next →

Building AI Agent Systems with Databricks

Building AI Agent Systems with Databricks

Databricks Workflows

Databricks Workflows

Automate Unity Catalog Upgrade with UCX Part 1: Overview

Automate Unity Catalog Upgrade with UCX Part 1: Overview

Automate Unity Catalog Upgrade with UCX Part 2: Installation

Automate Unity Catalog Upgrade with UCX Part 2: Installation

Automate Unity Catalog Upgrade with UCX Part 3 - Assessment

Automate Unity Catalog Upgrade with UCX Part 3 - Assessment

Automate Unity Catalog Upgrade with UCX Part 4 - Group Migration

Automate Unity Catalog Upgrade with UCX Part 4 - Group Migration

Table Migration and Catalog Design with UCX | Part 5

Table Migration and Catalog Design with UCX | Part 5

Setting Up Azure Access for UCX Table Migration | Part 6

Setting Up Azure Access for UCX Table Migration | Part 6

UCX Table Migration: Creating Catalogs and Schemas | Part 7

UCX Table Migration: Creating Catalogs and Schemas | Part 7

Automate Unity Catalog Upgrade with UCX Part 8: Code Migration

Automate Unity Catalog Upgrade with UCX Part 8: Code Migration

Streaming to Kafka Just Got Easier with DLT Pipelines

Streaming to Kafka Just Got Easier with DLT Pipelines

Data Engineering From Data to Dashboards with DABs: Crunching the Cookies Dataset

Data Engineering From Data to Dashboards with DABs: Crunching the Cookies Dataset

Epsilon helps businesses connect with their consumers using Databricks Data Intelligence Platform

Epsilon helps businesses connect with their consumers using Databricks Data Intelligence Platform

Unilever transforms operations with GenAI using the Databricks Data Intelligence Platform

Unilever transforms operations with GenAI using the Databricks Data Intelligence Platform

ActionIQ enables businesses to unlock customer data with the Databricks Data Intelligence Platform

ActionIQ enables businesses to unlock customer data with the Databricks Data Intelligence Platform

Mixed Attention & LLM Context | Data Brew | Episode 35

Mixed Attention & LLM Context | Data Brew | Episode 35

Inside Databricks SQL: Engineering innovation with Hans

Inside Databricks SQL: Engineering innovation with Hans

Inside Databricks: Engineering innovation with Michael Armbrust

Inside Databricks: Engineering innovation with Michael Armbrust

The Money Team at Databricks: driving revenue and customer growth

The Money Team at Databricks: driving revenue and customer growth

Unity Catalog unveiled: engineering data governance at scale

Unity Catalog unveiled: engineering data governance at scale

Create a view in Databricks and share it with Power BI using Delta Sharing

Create a view in Databricks and share it with Power BI using Delta Sharing

NDUS leverages Databricks Data Intelligence Platform to revolutionize higher education management

NDUS leverages Databricks Data Intelligence Platform to revolutionize higher education management

Démo Databricks de AI/BI

Démo Databricks de AI/BI

EMEA Data + AI World Tour 2024

EMEA Data + AI World Tour 2024

GenAI: The Shift to Data Intelligence - Customer Panel on Industry Use Cases

GenAI: The Shift to Data Intelligence - Customer Panel on Industry Use Cases

GenAI: The Shift to Data Intelligence - Ft. Ash Jhaveri, VP of Reality Labs Partnerships at Meta

GenAI: The Shift to Data Intelligence - Ft. Ash Jhaveri, VP of Reality Labs Partnerships at Meta

Virtue Foundation leverages the Databricks Data Intelligence Platform to advance global health

Virtue Foundation leverages the Databricks Data Intelligence Platform to advance global health

Announcing Synthetic Data Generation in Mosaic AI Agent Evaluation

Announcing Synthetic Data Generation in Mosaic AI Agent Evaluation

AI/BI Dashboards Embedding - A tutorial

AI/BI Dashboards Embedding - A tutorial

Bayer transforms global data management with the Databricks Data Intelligence Platform

Bayer transforms global data management with the Databricks Data Intelligence Platform

Databricks at AWS re:Invent 2024

Databricks at AWS re:Invent 2024

Hive Metastore and AWS Glue Federation in Unity Catalog

Hive Metastore and AWS Glue Federation in Unity Catalog

Data + AI World Tour Paris 2024

Data + AI World Tour Paris 2024

Retail reimagined: Currys data-first strategy to driving growth and improving operations

Retail reimagined: Currys data-first strategy to driving growth and improving operations

Mixture of Memory Experts (MoME) | Data Brew | Episode 36

Mixture of Memory Experts (MoME) | Data Brew | Episode 36

Verana Health Data Curation and Innovation with Databricks and AWS

Verana Health Data Curation and Innovation with Databricks and AWS

Securing SaaS Applications: Obsidian Security on Their Journey with Databricks and AWS

Securing SaaS Applications: Obsidian Security on Their Journey with Databricks and AWS

Twilio Eng VP on Data Intelligence & AI at AWS re:Invent 2024

Twilio Eng VP on Data Intelligence & AI at AWS re:Invent 2024

Chegg Eng SVP on Data-Driven Approach to Student Success with Databricks and AWS

Chegg Eng SVP on Data-Driven Approach to Student Success with Databricks and AWS

Ibotta Personalized Rewards Innovation with Databricks and AWS

Ibotta Personalized Rewards Innovation with Databricks and AWS

Simplify AI governance with #databricks AI Gateway

Simplify AI governance with #databricks AI Gateway

Databricks SQL and Power BI Integration

Databricks SQL and Power BI Integration

Databricks Serverless SQL Warehouses

Databricks Serverless SQL Warehouses

7 West powers audience growth with the Databricks Data Intelligence Platform

7 West powers audience growth with the Databricks Data Intelligence Platform

Secret to Production AI: Tools & Infrastructure | Data Brew | Episode 37

Secret to Production AI: Tools & Infrastructure | Data Brew | Episode 37

Skyflow CEO on Data Privacy with Databricks at AWS re:Invent

Skyflow CEO on Data Privacy with Databricks at AWS re:Invent

Databricks Clean Rooms Product Demo

Databricks Clean Rooms Product Demo

Dun & Bradstreet Enrichment & Monitoring, powered by Delta Sharing & Databricks Marketplace

Dun & Bradstreet Enrichment & Monitoring, powered by Delta Sharing & Databricks Marketplace

Unpacking Libraries in Databricks

Unpacking Libraries in Databricks

Providence uses an AI agent system from Databricks to help doctors improve their communication

Providence uses an AI agent system from Databricks to help doctors improve their communication

How State Street Uses AI to Transform Millions of Trades Daily

How State Street Uses AI to Transform Millions of Trades Daily

Vevo Therapeutics CEO on Curing Disease with Data at AWS re:Invent

Vevo Therapeutics CEO on Curing Disease with Data at AWS re:Invent

Over Architected with Nick & Holly: Databricks updates for Feb 2025

Over Architected with Nick & Holly: Databricks updates for Feb 2025

The Power of Synthetic Data | Data Brew | Episode 38

The Power of Synthetic Data | Data Brew | Episode 38

Use Databricks Lakehouse Federation to break down data silos

Use Databricks Lakehouse Federation to break down data silos

AI's rugby score: National Rugby League rallies fans with analytics and unified data

AI's rugby score: National Rugby League rallies fans with analytics and unified data

Open Variant Data Type in Delta Lake and Apache Spark

Open Variant Data Type in Delta Lake and Apache Spark

How would you sort Ætheldred in the alphabet using Databricks?

How would you sort Ætheldred in the alphabet using Databricks?

A guide on how to operationalize the Databricks AI Security Framework (DASF)

A guide on how to operationalize the Databricks AI Security Framework (DASF)

Future-Proof Your Asset Performance Management with Generative AI - Field Assistant Live Demo

Future-Proof Your Asset Performance Management with Generative AI - Field Assistant Live Demo

The video teaches viewers about the implementation and evolution of Databricks' Unity Catalog at Edward Jones, covering topics such as digital transformation, data governance, and data security. Viewers will learn about the challenges faced during the transition from Cloud V1.x to Cloud V2.0 and the improvements planned for the future. This knowledge is crucial for organizations undergoing digital transformation and seeking to improve their data management and analytics capabilities.

Key Takeaways

Design and deploy a Unity Catalog
Transition from Cloud V1.x to Cloud V2.0
Implement data governance and security measures
Use tools such as Calibra and Immuta for data classification and cataloging
Utilize Delta Share and Deep Clone for replication
Separate managed and external tables with bring your own storage account

💡 The implementation of Unity Catalog and the transition to Cloud V2.0 require careful planning and execution to ensure data security, governance, and quality.

🔒 Pro feature: Ask AI to explain this lesson →

More on: RAG Basics

View skill →

High Performance (Realtime) RAG Chains: From Basic to Advanced

High Performance (Realtime) RAG Chains: From Basic to Advanced

Coding the Ultimate RAG Engine from Zero

Coding the Ultimate RAG Engine from Zero

Building Agentic RAG From Scratch in Pure Python

Building Agentic RAG From Scratch in Pure Python

Build an LLM and RAG-based Chat Application using AlloyDB and LangChain

I Built a RAG App to Decode Airline Bureaucracy (So You Don't Have To)

I Built a RAG App to Decode Airline Bureaucracy (So You Don't Have To)

Akamai Developers

RAG Demo for Beginners: Full Hands-On Tutorial in Tamil | Build Your Own RAG AI | Karthik's Show

RAG Demo for Beginners: Full Hands-On Tutorial in Tamil | Build Your Own RAG AI | Karthik's Show

Related Reads

What Can We Do When Memory Becomes the New Bottleneck in Data Engineering?

Learn how to overcome memory bottlenecks in data engineering using Pandas chunking, Dask, and Polars, and why it matters for processing large datasets

Towards Data Science

Migrate from Ponder to Envio HyperIndex

Learn to migrate your indexer from Ponder to Envio HyperIndex to scale your data management

Dev.to · Envio

Data Backfilling with Apache Airflow: Architectures and Implementations for Historical Data Processing

Learn how to implement data backfilling with Apache Airflow for historical data processing and improve your data pipeline's accuracy and reliability

Dev.to · Wangila russell

Building a Production-Style Weather Analytics Pipeline from Scratch: ETL, ELT, Star Schema, and…

Learn to build a production-ready weather analytics pipeline from scratch using Python, DuckDB, and Apache tools, and understand the importance of ETL, ELT, and Star Schema in data engineering

Medium · Python

A Moment Frozen in Time | Arnav Iyengar | TEDxJenks Youth