Data Analytics Deep Dives - Dataplex Explore

Google Cloud · Intermediate ·🔄 Data Engineering ·2y ago

Key Takeaways

Demonstrates Dataplex Explore for executing Spark SQL against BigQuery tables and external data sources

Full Transcript

hello everyone today we'll take a look at dataplex Explorer for executing some spark SQL against bigquery internal tables external tables and Hive tables we also will take a look at the notebook experience using dedic Plex Explorer requires a data proc metastore to revision and attach to your data lake so I've gone ahead and automated this process as well as created your your default environment so let's jump in and see what we need to do the first thing you need to do is run the dag dataplex with hmx deploy this stands for Hive metastore service or the dataproc metastore and this will provision your data Lake the dataproc metastore and attach the medicine metastore to the taxi data Lake there's also a destroy one this runs every 15 minutes and after eight hours it will remove your metastore and your Lakes to save on costs so mine says disable for now but typically this will be enabled and it runs and it will deprovision your environment automatically so we had to head over to dataplex and we'll go to manage and we'll see our data Lakes so you'll have a series of data lakes here and you can click on Taxi data Lake and click on details and you should see the metastore service up and running and it has been attached to this Lake in the environments tab I've also provisioned the default environment and set it as fast startup so those two are already done by the uh by the dag and then the teardown dag will destroy or remove these resources so the next piece is the script so let me just jump to the cloud storage bucket so the SQL script for this demo is in the code bucket so the code bucket in dataplex dataplex Explorer and then we have a SQL script and you can just click this link and this will go ahead and bring open that script and you can copy this code so once you've copied the code head over to the dataplex Explorer I currently have this open since I have everything run and inside the Explorer make sure you pick taxi data Lake and then you'll have a new SQL script and I've simply pasted it in here so this is that same code we just saw so let's take a look at the code and run through the examples so the code is going to execute big bigquery directly from spark SQL so these are going to be bigquery internal tables next we'll do the data Lake tables or external tables we'll create some Hive tables and insert data and then we'll go ahead and write a query that joins bigquery internal tables to The Hive tables to the bigquery external tables so when you run these queries you can select them and you copy it and here I've just put it in query one and you would paste it here so we're going to go ahead and run this I have the results below but we'll kick it off and then we'll jump over to bigquery and look at what it's doing so I'm going to open a new tab here and here let's take a look at bigquery so you can see what tables are occurring in this example so when you run uh and create your data Lake you'll see these additional data sets being created they have the random extension that matches your project there and these are scanned and automatically discovered by dataplex so for instance here we have our process tables and these are going to be external tables on your data Lake you can see their external tables but we're right now we're querying tables that are internal to bigquery so let me click here and it's running and let's take a look at the data So within the data I had to initialize a session so it's going to take an extra minute or two the session shuts down after 10 minutes so again you can also configure that so we are looking at the curated Zone and inside the curated Zone you'll see these process tables and they match what's in bigquery so we see all of our process tables and then within here you'll see our bigquery tables so these tables are bigquery tables and we'll head over to bigquery just so you can see what that looks like and these will be our bigquery tables taxi trips so these tables are part of dataplex and then we have our Discover tables that are also part of dataplex and you can tell the difference because it will have the name of the Zone with the random extension so looking here at the data exploration side you have your curated zone that includes your external tables and your internal tables and these are discovered by automatically by dataplex and then we have our raw Zone that also has um discovered tables so this query executed and you can see it ran and brought back our results so this was just a pure bigquery internal table query and then the second query here is going to do the same exact thing we could execute this and kick it off but what this is going to do it's going to go ahead and do the process so these are going to be the external tables that we'll go ahead and run so we don't have to initialize the session but it will have to go ahead and kick off that spark job the next piece here is grading tables so we can create tables at basically Hive tables inside our warehouse here if you look at the fault you will see we have this table this is called the taxi driver table and we can open up this table and I'll open up in a second and see where it's stored and everything so this is stored inside the hive warehouse and under taxi driver and this could be any GCS bucket so I went ahead and created those two tables so I have the two tables taxi and tax review and then I'm gonna went ahead and inserted data so this is just part of the script I separated it into separate tabs so it's easy to follow so I went ahead and inserted data into the first table and then into the second table and then I was able to select the data so you can do a select star here run this and this will go ahead and kick off that spark job to select start from that table and let's take a look at the other table we have our select star from tax review so this table are Hive based tables and then we have our final query here that you can execute and play around with would be the bigquery internal tables join to our Hive table which is then joined to our taxi lake table so it does get results and I'll kick this off as well and we'll let this go ahead and run so while this is running you can save these scripts and you can also do script sharing so if you wanted to share this script with other people you can go ahead and share this to other people in your organization and they can have access to them and they will show up here in the spark SQL scripts so you have security around the various scripts so when we look at the hive table let's see if we can open up the high table where those things are executing so inside the details tab you'll see the full path of this table since I didn't specify a uh when I did a crate table here I just use a relative path I didn't specify the full GCS path so it puts it inside the metastore bucket so typically I would put this in my own bucket so I know exactly where that table is so this script came back so you can see here we've joined between all three different environments bigquery internal bigquery external and our high table the other piece I like to jump into is data catalog so let me see if I can get back here and scroll all the way back good and data catalog we can see where these uh Hive tables are stored this opened up a new tab and it will load up the texture review and you can see even though we've created a hive table it is inside our data catalog and it's shown as dataproc metastore and then we can go ahead and see that schema and we kind of clearly see schema our lineage as well so that is uh available inside the data catalog in addition we can schedule this so I can go ahead and click create a schedule let's do a smaller script here so we can go ahead and create a schedule and we can give it a name we'll just call it test for now to schedule and repeat daily or you can do custom so you can go ahead and do a custom and do a crown expression and then you can pick your output folder name so in this case I would typically pick a folder over here and have it called dataplex Explorer output and our script format we can then export this as CSV Json or parquet so it gives you that flexibility I will and then your service account so typically I have a dataproc service account already provisioned and I would use this service account I will make one note that in order to run this it expects you to have the default Network set up since this demo doesn't have the default Network set up that option currently isn't available so I would be scheduling this with the rest API the rest API would let you run this and inside of the parameters here you can do some execution spec pick the service account here we can also look at the other pieces I think it's under uh uh spark we could pick our our envir our python script and then we can also pick our project and uh under one of these we'll have our infrastructure spec and that would let us run the environment and pick that so forget where it's at right now I'm checking um but let's see it's under execution spec it should be under one of these where it shows the actual environment and we can pick it uh let me check here here we go it's under notebook and under infrastructure spec I can pick the VPC Network and go ahead and pick the subnet so it is buried a little bit under there this would be the curl command so it's very useful to go ahead and copy this and then you can run this locally typically I replace this with a gcloud to get my current token but then you could schedule this if you have a different network until that option is available in the user interface so we also have the ability to do notebooks so I'm going to go ahead and open up a new tab and I'll just leave that stuff open and we'll look at the notebook experience so it doesn't load up those bar pieces but we will pick notebooks and here's a notebook that I brought in um so this is currently a existing notebook to bring in the demo notebook you would select import and the three dots up here and here we can find our notebook and that's going to be in the code folder and we'll pick dataplex dataplex Explorer and we can pick our notebook and I'll just call this demo for now and we can import The Notebook so I will go ahead and bring the notebook and again the ability to schedule a notebook is here you can click this you can share it and you can also schedule it so same scheduling user interface it just has an output directory so I'll click on that if I create a schedule it will be that exact same interface as the output name for your results but let's go ahead and open the notebook and give it a quick run through first time I don't know if when you run this it might authenticate I think I've already authenticated so let's go ahead and see if it will pop up there it goes so while this opens you're able to run a pi python code in within the notebook and also the bigquery keyword uh Magic keyword has been installed so for instance if I want to execute this cell I can go ahead and run it with the bigquery magic keyword and this will send this query over to bigquery and we'll get our results directly back in our notebook so it's very handy so if you're exploring bigquery tables you can go ahead and use the magic keyword you can also put the data into a data frame so if I execute this it's going to run a query and it's going to put the results into a data frame I'll go ahead and click display that data frame and then we can bring in a plot so we can go ahead and show a chart and that should bring bring in that library and then we will execute and show a graph of our results from our data frame if I give that a second to run and we have our results so jumping back to the notebook here you can share them you can change your environments so you might have more than one environment so that's available and you can go ahead and explore those options so I think that covers everything I wanted to show during the demo so go ahead and give dataplex Explorer a try and as always thanks for watching bye

Original Description

Provides an overview of Dataplex Explore for executing some Spark SQL against BigQuery internal tables, external tables and Hive tables. The demo also shows how you can use a notebook along with scheduling and sharing your artifacts. Everything is provisioned via an Airflow DAG using Terraform to setup the data lakes, Dataproc Metastore and the Dataplex environment so you can create your own queries. All code is on GitHub: https://goo.gle/dagd

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Google Cloud · Google Cloud · 0 of 60

← Previous Next →

Top 3 ways organizations are adjusting their cloud strategies to prepare for economic uncertainty

Top 3 ways organizations are adjusting their cloud strategies to prepare for economic uncertainty

Google Cloud Retail Search and Browse Console deep dive

Google Cloud Retail Search and Browse Console deep dive

Google Cloud Backup and DR - How to mount, clone or restore a VMware VM

Google Cloud Backup and DR - How to mount, clone or restore a VMware VM

Google Cloud Backup and DR - VMware vSphere Backup Overview

Google Cloud Backup and DR - VMware vSphere Backup Overview

Google Cloud Backup and DR - Creating backup Plans for VMware VM backups

Google Cloud Backup and DR - Creating backup Plans for VMware VM backups

Google Cloud Backup and DR - Compute Engine Instance Backups and Sole Tenant Nodes

Google Cloud Backup and DR - Compute Engine Instance Backups and Sole Tenant Nodes

Google Cloud Backup and DR - Managing Service Accounts

Google Cloud Backup and DR - Managing Service Accounts

Let’s solve for what’s next

Let’s solve for what’s next

Google Cloud Executive Briefing Center | Cloud Space | Silicon Valley

Google Cloud Executive Briefing Center | Cloud Space | Silicon Valley

Tinyclues with Google Cloud offers CRM Intelligence to maximize conversions

Tinyclues with Google Cloud offers CRM Intelligence to maximize conversions

Aible partners with Google Cloud helping customers build predictive models within minutes

Aible partners with Google Cloud helping customers build predictive models within minutes

TELUS streamlines big data ingestion with help from Google Cloud and Accenture

TELUS streamlines big data ingestion with help from Google Cloud and Accenture

Getting started with Apigee API Management

Getting started with Apigee API Management

Google Cloud Retail Search

Google Cloud Retail Search

Building your first API proxy with Apigee

Building your first API proxy with Apigee

Brands and agencies develop dynamic video ads with Connected-Stories NEXT and Google Cloud

Brands and agencies develop dynamic video ads with Connected-Stories NEXT and Google Cloud

Redefining the transportation industry

Redefining the transportation industry

Google Cloud Project Katalyst

Google Cloud Project Katalyst

Israel's Family Court: Creating more compelling experiences for its citizens

Israel's Family Court: Creating more compelling experiences for its citizens

Tausight partners with Google Cloud to help healthcare industry protect PHI activity & take action

Tausight partners with Google Cloud to help healthcare industry protect PHI activity & take action

Google Cloud Retail Browse

Google Cloud Retail Browse

Verifying API keys and debugging your API proxy flow

Verifying API keys and debugging your API proxy flow

Getting started with Apigee API Management

Getting started with Apigee API Management

Adding policies to your APIs

Adding policies to your APIs

Google Cloud Backup and DR - Configuring Google Cloud VMware Engine to work with Backup and DR

Google Cloud Backup and DR - Configuring Google Cloud VMware Engine to work with Backup and DR

Topaz Subsea Cable

Topaz Subsea Cable

Episode 29: Building a culture of data literacy with Latin America’s biggest ecommerce platform

Episode 29: Building a culture of data literacy with Latin America’s biggest ecommerce platform

Weshalb Datananalysten die Sparringspartner von Produktmanagern sein sollten

Weshalb Datananalysten die Sparringspartner von Produktmanagern sein sollten

Warum und wie METRO eine Machine Learning-Pipeline implementiert hat

Warum und wie METRO eine Machine Learning-Pipeline implementiert hat

Wie nutzt METRO Data Science, um geschäftliche Herausforderungen zu meistern?

Wie nutzt METRO Data Science, um geschäftliche Herausforderungen zu meistern?

Google Cloud in Qatar. Let's get solving.

Google Cloud in Qatar. Let's get solving.

Google Cloud for Qatar

Google Cloud for Qatar

Doha has a new Google Cloud region

Doha has a new Google Cloud region

The new Google Cloud region in Qatar

The new Google Cloud region in Qatar

Build, tune, and deploy foundation models with Vertex AI

Build, tune, and deploy foundation models with Vertex AI

Generative AI on Google Cloud

Generative AI on Google Cloud

Who will be coming to Google Cloud Day Tel Aviv? #Shorts

Who will be coming to Google Cloud Day Tel Aviv? #Shorts

Protect your organization at the edge

Protect your organization at the edge

Google Cloud Backup and DR Alert Notifications setup

Google Cloud Backup and DR Alert Notifications setup

Build, tune, and deploy foundation models with Generative AI Support in Vertex AI

Build, tune, and deploy foundation models with Generative AI Support in Vertex AI

Where the Internet Lives: Data center on the prairie

Where the Internet Lives: Data center on the prairie

Which developer program are you joining?

Which developer program are you joining?

Lufthansa Group baut intelligente Systeme zur Vereinfachung des Flugbetriebs

Lufthansa Group baut intelligente Systeme zur Vereinfachung des Flugbetriebs

How ASML revived Moore's Law and remade chipmaking

How ASML revived Moore's Law and remade chipmaking

CMO of Unity celebrates Women's History Month

CMO of Unity celebrates Women's History Month

Vint Cerf on Google Cloud Digital Leader

Vint Cerf on Google Cloud Digital Leader

Mobile World Congress 2023

Mobile World Congress 2023

Google Data Cloud & AI Summit 2023: Reveal opportunities to transform your business

Google Data Cloud & AI Summit 2023: Reveal opportunities to transform your business

Building a conversational bot with Google Cloud Gen App Builder

Building a conversational bot with Google Cloud Gen App Builder

Elisa Polystar and Google Cloud partner to bring the power of analytics and automation to CSPs

Elisa Polystar and Google Cloud partner to bring the power of analytics and automation to CSPs

Network modernization - how can CSPs start now?

Network modernization - how can CSPs start now?

How Semios uses imported and remote models for inference with BigQuery ML

How Semios uses imported and remote models for inference with BigQuery ML

Deliver your AI solutions up to 100 times faster with Google Cloud partner, Snorkel AI

Deliver your AI solutions up to 100 times faster with Google Cloud partner, Snorkel AI

Capture consumer perspectives for CPG using NLP and analytics with Harmonya and Google Cloud

Capture consumer perspectives for CPG using NLP and analytics with Harmonya and Google Cloud

Delivering Cloud-Native Network Transformation

Delivering Cloud-Native Network Transformation

Proactively detect & investigate anomalies & data quality issues in BigQuery with Telmai

Proactively detect & investigate anomalies & data quality issues in BigQuery with Telmai

Introducing AlloyDB Omni

Introducing AlloyDB Omni

Episode 30: How Auto Trader transitioned to the cloud to analyze tricky customer data

Episode 30: How Auto Trader transitioned to the cloud to analyze tricky customer data

MongoDB Atlas on Google Cloud

MongoDB Atlas on Google Cloud

Related Reads

What Can We Do When Memory Becomes the New Bottleneck in Data Engineering?

Learn how to overcome memory bottlenecks in data engineering using Pandas chunking, Dask, and Polars, and why it matters for processing large datasets

Towards Data Science

Migrate from Ponder to Envio HyperIndex

Learn to migrate your indexer from Ponder to Envio HyperIndex to scale your data management

Dev.to · Envio

Data Backfilling with Apache Airflow: Architectures and Implementations for Historical Data Processing

Learn how to implement data backfilling with Apache Airflow for historical data processing and improve your data pipeline's accuracy and reliability

Dev.to · Wangila russell

Building a Production-Style Weather Analytics Pipeline from Scratch: ETL, ELT, Star Schema, and…

Learn to build a production-ready weather analytics pipeline from scratch using Python, DuckDB, and Apache tools, and understand the importance of ETL, ELT, and Star Schema in data engineering

Medium · Python

A Moment Frozen in Time | Arnav Iyengar | TEDxJenks Youth