Data Analytics Deep Dives - Dataplex Explore

Google Cloud · Intermediate ·🔄 Data Engineering ·2y ago

Key Takeaways

Demonstrates Dataplex Explore for executing Spark SQL against BigQuery tables and external data sources

Full Transcript

hello everyone today we'll take a look at dataplex Explorer for executing some spark SQL against bigquery internal tables external tables and Hive tables we also will take a look at the notebook experience using dedic Plex Explorer requires a data proc metastore to revision and attach to your data lake so I've gone ahead and automated this process as well as created your your default environment so let's jump in and see what we need to do the first thing you need to do is run the dag dataplex with hmx deploy this stands for Hive metastore service or the dataproc metastore and this will provision your data Lake the dataproc metastore and attach the medicine metastore to the taxi data Lake there's also a destroy one this runs every 15 minutes and after eight hours it will remove your metastore and your Lakes to save on costs so mine says disable for now but typically this will be enabled and it runs and it will deprovision your environment automatically so we had to head over to dataplex and we'll go to manage and we'll see our data Lakes so you'll have a series of data lakes here and you can click on Taxi data Lake and click on details and you should see the metastore service up and running and it has been attached to this Lake in the environments tab I've also provisioned the default environment and set it as fast startup so those two are already done by the uh by the dag and then the teardown dag will destroy or remove these resources so the next piece is the script so let me just jump to the cloud storage bucket so the SQL script for this demo is in the code bucket so the code bucket in dataplex dataplex Explorer and then we have a SQL script and you can just click this link and this will go ahead and bring open that script and you can copy this code so once you've copied the code head over to the dataplex Explorer I currently have this open since I have everything run and inside the Explorer make sure you pick taxi data Lake and then you'll have a new SQL script and I've simply pasted it in here so this is that same code we just saw so let's take a look at the code and run through the examples so the code is going to execute big bigquery directly from spark SQL so these are going to be bigquery internal tables next we'll do the data Lake tables or external tables we'll create some Hive tables and insert data and then we'll go ahead and write a query that joins bigquery internal tables to The Hive tables to the bigquery external tables so when you run these queries you can select them and you copy it and here I've just put it in query one and you would paste it here so we're going to go ahead and run this I have the results below but we'll kick it off and then we'll jump over to bigquery and look at what it's doing so I'm going to open a new tab here and here let's take a look at bigquery so you can see what tables are occurring in this example so when you run uh and create your data Lake you'll see these additional data sets being created they have the random extension that matches your project there and these are scanned and automatically discovered by dataplex so for instance here we have our process tables and these are going to be external tables on your data Lake you can see their external tables but we're right now we're querying tables that are internal to bigquery so let me click here and it's running and let's take a look at the data So within the data I had to initialize a session so it's going to take an extra minute or two the session shuts down after 10 minutes so again you can also configure that so we are looking at the curated Zone and inside the curated Zone you'll see these process tables and they match what's in bigquery so we see all of our process tables and then within here you'll see our bigquery tables so these tables are bigquery tables and we'll head over to bigquery just so you can see what that looks like and these will be our bigquery tables taxi trips so these tables are part of dataplex and then we have our Discover tables that are also part of dataplex and you can tell the difference because it will have the name of the Zone with the random extension so looking here at the data exploration side you have your curated zone that includes your external tables and your internal tables and these are discovered by automatically by dataplex and then we have our raw Zone that also has um discovered tables so this query executed and you can see it ran and brought back our results so this was just a pure bigquery internal table query and then the second query here is going to do the same exact thing we could execute this and kick it off but what this is going to do it's going to go ahead and do the process so these are going to be the external tables that we'll go ahead and run so we don't have to initialize the session but it will have to go ahead and kick off that spark job the next piece here is grading tables so we can create tables at basically Hive tables inside our warehouse here if you look at the fault you will see we have this table this is called the taxi driver table and we can open up this table and I'll open up in a second and see where it's stored and everything so this is stored inside the hive warehouse and under taxi driver and this could be any GCS bucket so I went ahead and created those two tables so I have the two tables taxi and tax review and then I'm gonna went ahead and inserted data so this is just part of the script I separated it into separate tabs so it's easy to follow so I went ahead and inserted data into the first table and then into the second table and then I was able to select the data so you can do a select star here run this and this will go ahead and kick off that spark job to select start from that table and let's take a look at the other table we have our select star from tax review so this table are Hive based tables and then we have our final query here that you can execute and play around with would be the bigquery internal tables join to our Hive table which is then joined to our taxi lake table so it does get results and I'll kick this off as well and we'll let this go ahead and run so while this is running you can save these scripts and you can also do script sharing so if you wanted to share this script with other people you can go ahead and share this to other people in your organization and they can have access to them and they will show up here in the spark SQL scripts so you have security around the various scripts so when we look at the hive table let's see if we can open up the high table where those things are executing so inside the details tab you'll see the full path of this table since I didn't specify a uh when I did a crate table here I just use a relative path I didn't specify the full GCS path so it puts it inside the metastore bucket so typically I would put this in my own bucket so I know exactly where that table is so this script came back so you can see here we've joined between all three different environments bigquery internal bigquery external and our high table the other piece I like to jump into is data catalog so let me see if I can get back here and scroll all the way back good and data catalog we can see where these uh Hive tables are stored this opened up a new tab and it will load up the texture review and you can see even though we've created a hive table it is inside our data catalog and it's shown as dataproc metastore and then we can go ahead and see that schema and we kind of clearly see schema our lineage as well so that is uh available inside the data catalog in addition we can schedule this so I can go ahead and click create a schedule let's do a smaller script here so we can go ahead and create a schedule and we can give it a name we'll just call it test for now to schedule and repeat daily or you can do custom so you can go ahead and do a custom and do a crown expression and then you can pick your output folder name so in this case I would typically pick a folder over here and have it called dataplex Explorer output and our script format we can then export this as CSV Json or parquet so it gives you that flexibility I will and then your service account so typically I have a dataproc service account already provisioned and I would use this service account I will make one note that in order to run this it expects you to have the default Network set up since this demo doesn't have the default Network set up that option currently isn't available so I would be scheduling this with the rest API the rest API would let you run this and inside of the parameters here you can do some execution spec pick the service account here we can also look at the other pieces I think it's under uh uh spark we could pick our our envir our python script and then we can also pick our project and uh under one of these we'll have our infrastructure spec and that would let us run the environment and pick that so forget where it's at right now I'm checking um but let's see it's under execution spec it should be under one of these where it shows the actual environment and we can pick it uh let me check here here we go it's under notebook and under infrastructure spec I can pick the VPC Network and go ahead and pick the subnet so it is buried a little bit under there this would be the curl command so it's very useful to go ahead and copy this and then you can run this locally typically I replace this with a gcloud to get my current token but then you could schedule this if you have a different network until that option is available in the user interface so we also have the ability to do notebooks so I'm going to go ahead and open up a new tab and I'll just leave that stuff open and we'll look at the notebook experience so it doesn't load up those bar pieces but we will pick notebooks and here's a notebook that I brought in um so this is currently a existing notebook to bring in the demo notebook you would select import and the three dots up here and here we can find our notebook and that's going to be in the code folder and we'll pick dataplex dataplex Explorer and we can pick our notebook and I'll just call this demo for now and we can import The Notebook so I will go ahead and bring the notebook and again the ability to schedule a notebook is here you can click this you can share it and you can also schedule it so same scheduling user interface it just has an output directory so I'll click on that if I create a schedule it will be that exact same interface as the output name for your results but let's go ahead and open the notebook and give it a quick run through first time I don't know if when you run this it might authenticate I think I've already authenticated so let's go ahead and see if it will pop up there it goes so while this opens you're able to run a pi python code in within the notebook and also the bigquery keyword uh Magic keyword has been installed so for instance if I want to execute this cell I can go ahead and run it with the bigquery magic keyword and this will send this query over to bigquery and we'll get our results directly back in our notebook so it's very handy so if you're exploring bigquery tables you can go ahead and use the magic keyword you can also put the data into a data frame so if I execute this it's going to run a query and it's going to put the results into a data frame I'll go ahead and click display that data frame and then we can bring in a plot so we can go ahead and show a chart and that should bring bring in that library and then we will execute and show a graph of our results from our data frame if I give that a second to run and we have our results so jumping back to the notebook here you can share them you can change your environments so you might have more than one environment so that's available and you can go ahead and explore those options so I think that covers everything I wanted to show during the demo so go ahead and give dataplex Explorer a try and as always thanks for watching bye

Original Description

Provides an overview of Dataplex Explore for executing some Spark SQL against BigQuery internal tables, external tables and Hive tables. The demo also shows how you can use a notebook along with scheduling and sharing your artifacts. Everything is provisioned via an Airflow DAG using Terraform to setup the data lakes, Dataproc Metastore and the Dataplex environment so you can create your own queries. All code is on GitHub: https://goo.gle/dagd
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Google Cloud · Google Cloud · 0 of 60

← Previous Next →
1 Top 3 ways organizations are adjusting their cloud strategies to prepare for economic uncertainty
Top 3 ways organizations are adjusting their cloud strategies to prepare for economic uncertainty
Google Cloud
2 Google Cloud Retail Search and Browse Console deep dive
Google Cloud Retail Search and Browse Console deep dive
Google Cloud
3 Google Cloud Backup and DR - How to mount, clone or restore a VMware VM
Google Cloud Backup and DR - How to mount, clone or restore a VMware VM
Google Cloud
4 Google Cloud Backup and DR - VMware vSphere Backup Overview
Google Cloud Backup and DR - VMware vSphere Backup Overview
Google Cloud
5 Google Cloud Backup and DR - Creating backup Plans for VMware VM backups
Google Cloud Backup and DR - Creating backup Plans for VMware VM backups
Google Cloud
6 Google Cloud Backup and DR - Compute Engine Instance Backups and Sole Tenant Nodes
Google Cloud Backup and DR - Compute Engine Instance Backups and Sole Tenant Nodes
Google Cloud
7 Google Cloud Backup and DR - Managing Service Accounts
Google Cloud Backup and DR - Managing Service Accounts
Google Cloud
8 Let’s solve for what’s next
Let’s solve for what’s next
Google Cloud
9 Google Cloud Executive Briefing Center | Cloud Space | Silicon Valley
Google Cloud Executive Briefing Center | Cloud Space | Silicon Valley
Google Cloud
10 Tinyclues with Google Cloud offers CRM Intelligence to maximize conversions
Tinyclues with Google Cloud offers CRM Intelligence to maximize conversions
Google Cloud
11 Aible partners with Google Cloud helping customers build predictive models within minutes
Aible partners with Google Cloud helping customers build predictive models within minutes
Google Cloud
12 TELUS streamlines big data ingestion with help from Google Cloud and Accenture
TELUS streamlines big data ingestion with help from Google Cloud and Accenture
Google Cloud
13 Getting started with Apigee API Management
Getting started with Apigee API Management
Google Cloud
14 Google Cloud Retail Search
Google Cloud Retail Search
Google Cloud
15 Building your first API proxy with Apigee
Building your first API proxy with Apigee
Google Cloud
16 Brands and agencies develop dynamic video ads with Connected-Stories NEXT and Google Cloud
Brands and agencies develop dynamic video ads with Connected-Stories NEXT and Google Cloud
Google Cloud
17 Redefining the transportation industry
Redefining the transportation industry
Google Cloud
18 Google Cloud Project Katalyst
Google Cloud Project Katalyst
Google Cloud
19 Israel's Family Court: Creating more compelling experiences for its citizens
Israel's Family Court: Creating more compelling experiences for its citizens
Google Cloud
20 Tausight partners with Google Cloud to help healthcare industry protect PHI activity & take action
Tausight partners with Google Cloud to help healthcare industry protect PHI activity & take action
Google Cloud
21 Google Cloud Retail Browse
Google Cloud Retail Browse
Google Cloud
22 Verifying API keys and debugging your API proxy flow
Verifying API keys and debugging your API proxy flow
Google Cloud
23 Getting started with Apigee API Management
Getting started with Apigee API Management
Google Cloud
24 Adding policies to your APIs
Adding policies to your APIs
Google Cloud
25 Google Cloud Backup and DR - Configuring Google Cloud VMware Engine to work with Backup and DR
Google Cloud Backup and DR - Configuring Google Cloud VMware Engine to work with Backup and DR
Google Cloud
26 Topaz Subsea Cable
Topaz Subsea Cable
Google Cloud
27 Episode 29: Building a culture of data literacy with Latin America’s biggest ecommerce platform
Episode 29: Building a culture of data literacy with Latin America’s biggest ecommerce platform
Google Cloud
28 Weshalb Datananalysten die Sparringspartner von Produktmanagern sein sollten
Weshalb Datananalysten die Sparringspartner von Produktmanagern sein sollten
Google Cloud
29 Warum und wie METRO eine Machine Learning-Pipeline implementiert hat
Warum und wie METRO eine Machine Learning-Pipeline implementiert hat
Google Cloud
30 Wie nutzt METRO Data Science, um geschäftliche Herausforderungen zu meistern?
Wie nutzt METRO Data Science, um geschäftliche Herausforderungen zu meistern?
Google Cloud
31 Google Cloud in Qatar. Let's get solving.
Google Cloud in Qatar. Let's get solving.
Google Cloud
32 Google Cloud for Qatar
Google Cloud for Qatar
Google Cloud
33 Doha has a new Google Cloud region
Doha has a new Google Cloud region
Google Cloud
34 The new Google Cloud region in Qatar
The new Google Cloud region in Qatar
Google Cloud
35 Build, tune, and deploy foundation models with Vertex AI
Build, tune, and deploy foundation models with Vertex AI
Google Cloud
36 Generative AI on Google Cloud
Generative AI on Google Cloud
Google Cloud
37 Who will be coming to Google Cloud Day Tel Aviv? #Shorts
Who will be coming to Google Cloud Day Tel Aviv? #Shorts
Google Cloud
38 Protect your organization at the edge
Protect your organization at the edge
Google Cloud
39 Google Cloud Backup and DR Alert Notifications setup
Google Cloud Backup and DR Alert Notifications setup
Google Cloud
40 Build, tune, and deploy foundation models with Generative AI Support in Vertex AI
Build, tune, and deploy foundation models with Generative AI Support in Vertex AI
Google Cloud
41 Where the Internet Lives: Data center on the prairie
Where the Internet Lives: Data center on the prairie
Google Cloud
42 Which developer program are you joining?
Which developer program are you joining?
Google Cloud
43 Lufthansa Group baut intelligente Systeme zur Vereinfachung des Flugbetriebs
Lufthansa Group baut intelligente Systeme zur Vereinfachung des Flugbetriebs
Google Cloud
44 How ASML revived Moore's Law and remade chipmaking
How ASML revived Moore's Law and remade chipmaking
Google Cloud
45 CMO of Unity celebrates Women's History Month
CMO of Unity celebrates Women's History Month
Google Cloud
46 Vint Cerf on Google Cloud Digital Leader
Vint Cerf on Google Cloud Digital Leader
Google Cloud
47 Mobile World Congress 2023
Mobile World Congress 2023
Google Cloud
48 Topaz - Canada
Topaz - Canada
Google Cloud
49 Google Data Cloud & AI Summit 2023: Reveal opportunities to transform your business
Google Data Cloud & AI Summit 2023: Reveal opportunities to transform your business
Google Cloud
50 Building a conversational bot with Google Cloud Gen App Builder
Building a conversational bot with Google Cloud Gen App Builder
Google Cloud
51 Elisa Polystar and Google Cloud partner to bring the power of analytics and automation to CSPs
Elisa Polystar and Google Cloud partner to bring the power of analytics and automation to CSPs
Google Cloud
52 Network modernization - how can CSPs start now?
Network modernization - how can CSPs start now?
Google Cloud
53 How Semios uses imported and remote models for inference with BigQuery ML
How Semios uses imported and remote models for inference with BigQuery ML
Google Cloud
54 Deliver your AI solutions up to 100 times faster with Google Cloud partner, Snorkel AI
Deliver your AI solutions up to 100 times faster with Google Cloud partner, Snorkel AI
Google Cloud
55 Capture consumer perspectives for CPG using NLP and analytics with Harmonya and Google Cloud
Capture consumer perspectives for CPG using NLP and analytics with Harmonya and Google Cloud
Google Cloud
56 Delivering Cloud-Native Network Transformation
Delivering Cloud-Native Network Transformation
Google Cloud
57 Proactively detect & investigate anomalies & data quality issues in BigQuery with Telmai
Proactively detect & investigate anomalies & data quality issues in BigQuery with Telmai
Google Cloud
58 Introducing AlloyDB Omni
Introducing AlloyDB Omni
Google Cloud
59 Episode 30: How Auto Trader transitioned to the cloud to analyze tricky customer data
Episode 30: How Auto Trader transitioned to the cloud to analyze tricky customer data
Google Cloud
60 MongoDB Atlas on Google Cloud
MongoDB Atlas on Google Cloud
Google Cloud

Related Reads

📰
What Can We Do When Memory Becomes the New Bottleneck in Data Engineering?
Learn how to overcome memory bottlenecks in data engineering using Pandas chunking, Dask, and Polars, and why it matters for processing large datasets
Towards Data Science
📰
Migrate from Ponder to Envio HyperIndex
Learn to migrate your indexer from Ponder to Envio HyperIndex to scale your data management
Dev.to · Envio
📰
Data Backfilling with Apache Airflow: Architectures and Implementations for Historical Data Processing
Learn how to implement data backfilling with Apache Airflow for historical data processing and improve your data pipeline's accuracy and reliability
Dev.to · Wangila russell
📰
Building a Production-Style Weather Analytics Pipeline from Scratch: ETL, ELT, Star Schema, and…
Learn to build a production-ready weather analytics pipeline from scratch using Python, DuckDB, and Apache tools, and understand the importance of ETL, ELT, and Star Schema in data engineering
Medium · Python
Up next
A Moment Frozen in Time | Arnav Iyengar | TEDxJenks Youth
TEDx Talks
Watch →