Data Engineering with Databricks #dataengineering #databricks

Analytics Vidhya · Beginner ·🔄 Data Engineering ·3y ago

Skills: ML Pipelines80%Data Literacy70%SQL Analytics60%

Key Takeaways

Databricks is a cloud-based collaborative data science, data engineering, and data analytics platform that combines the best of data warehouses and data lakes into a lake house architecture, supporting big data, batch processing, and stream processing.

Full Transcript

foreign [Music] [Music] [Music] a vehicle that these sessions are going to be great source of enrichment and value adding for community members now on to our session today the topic of which is data engineering with databricks database is a one Cloud platform for massive scale data engineering and collaborative data science and open and unified platform for data engineering machine learning and analytics I hope you are excited to attain this data with us [Music] they are recording sessions and the recording will be available on our YouTube channel the link will be given to you in the chat section please use the Q and A section for asking any questions you might have during the session and we will do our best to answer them as the data progresses or towards the end we will be sharing a feedback from towards the end you are requested to kindly fill that up before leaving the session thank you now onto a speaker uh in this session of data we have um with an inclination to learn new tools and Technologies who have always worked with organizations that strives for Mutual growth and co-elevation she is currently working as a data engineer data scientist at me coaching services all right so welcome everyone to this session um as uh the moderator from analytics Vidya introduced me my name is sumama Fatima and I am currently working as a data engineer and today we are going to talk about data engineering with databricks so I'm going to start by sharing my screen okay so before we start deep diving into what databricks actually is I'm going to give you an introduction of some of the popular data analytics terms so we are going to take a look at what big data is and going to recap the difference between data warehouses and data leaks and then we are going to understand the problem that data breaks solved and later on in this session there's going to be a Hands-On overview of the databricks platform so you're going to get an idea of the different services that are available in the data science and data engineering workspace of databricks so starting off uh what is Big Data so we know that there are three main characteristics that define Big Data so the first one is volume large volumes of data that is terabytes of data is considered as Big Data so this huge volume of data is generated in a wide variety of formats so the format can be the data can be either a structured data or unstructured data or it can be semi-structured data so a good example of structured data is your relational database and if we talk about unstructured data then your images and your videos are part of your unstructured data and semi-structured data is data that has some sort of structure in it but it needs to be modified a little bit before it can have a tabular structure a ski mask like structure so your JS Json files your CSV files are part of semi-structured data and lastly third characteristic of big data is velocity so data that is generated with a high velocity high volume and wide variety of data is considered as Big Data so and uh when we talk about the velocity with which big data is generated and processed then there are two main things that we talk about the first is patch processing or bad string and the other is stream processing so in batch processing data is processed with high latency High delay time and usually data is large amounts of data is processed in batches so a good example of this can be like for example there is a company who wants to estimate that how much sales they have made by the end of the month so they gather all the data and by the end of the month they the data is processed and their total sales are calculated so this is an example of batch processing if we talk about stream processing that is data that is generated in milliseconds and it also needs to be processed right away so data that is generated in real time and you need the data analytics in real time as well so example of this scenario can be Netflix so we all know that when we are browsing Netflix so the whatever movie or season we are watching or whatever your season we are to our list of our list of movies and Seasons then uh Netflix has to update the data right away so the uh data is processed in real time so these are two types of uh database processing batch processing and Screen processing the next thing but the next important thing is understanding the difference between data warehouse and data Lake so traditionally when we talk about data warehouses data warehouses are equipped with working with structured process data and as compared to data warehouses data Lakes can deal with all three types of data so they can store structured semi-structured and unstructured data then data warehouses have schema on right that means that they have an enforced schema so when you create a database you know you know you specify the schema beforehand and so the schema is always enforced whenever we are writing something to the table you cannot violate the schema when you're writing to a data warehouse whereas us in a data Lake the schema is on read then data warehouse is not uh is expensive for large data volume so when we have very huge volumes of data and even when we have streaming data then data warehouses give poor performance whereas data leaks are designed for Low Cost Storage data warehouses since they have a fixed schema they are less agile they have a fixed configuration whereas data Lakes are highly agile and they can be configured and reconfigured if needed so data warehouses generally have lesser volumes of data and they are the data that is stored in data warehouses is generally cleaned so this data is suitable for business intelligence and Reporting whereas the data in data Lake can be structured semi-structured unstructured that is why all your machine learning tasks and stream analytics models they prefer data Lake now uh before databricks was developed the problem with data warehouses and data Lakes was that data warehouses were being used for business intelligence and data warehouses dealt withdrawing structured data so you could only store your clean data and data warehouses but if you needed your data for machine learning tasks then you needed a data lake so the sort of flow that was being followed was that there's a data Lake that has all of your data to structured semi-structured and unstructured data then you perform some sort of ETL operation on it you do data preparation data validation transformation and you store that data in a data warehouse and that data warehouse is then used for business intelligence and reports and for your machine learning and data science tasks you still use your data Lake now the disadvantage of this model is that there are multiple copies of data existing there is a copy of data that is existing in your data Lake and there is another copy of data that is existing in your data warehouse one big disadvantage of this is that there is a risk that different people are working with different versions of data so let's say your data engineer has got an access to recent data new data a newer version of data but that data is still in raw format and your data engine engineer is preparing it and your data warehouse is still has your old version of data so your business intelligence team still dealing with an older version so this created a problem there was a lesser collaboration between your data Engineers your business intelligence team your data analysts and machine learning engineers so databricks proposed a platform that unified this now let's take a look at what data bricks is so Terror bricks is a cloud-based collaborative data science data engineering and data analytics platform that combines the best of data warehouses and data Lakes into a lake house architecture so data breaks is a single platform on which all your data teams can work you don't need different people working on different Services your data scientists your data Engineers your data analysts can all work on this one single platform now uh the second part of data breaks that defines databricks is that it combines the best of data warehouses and data Lakes into a lake house architecture if you go back to the previous slide we saw that for business intelligence and reports data warehouses were preferred and for machine learning tasks data Lakes are preferred so data bricks combine these two into a single architecture so that one architecture could satisfy the needs of your business intelligence as well as machine learning tasks databricks is hosted on all three popular Cloud platforms so it is available in AWS it is available in Microsoft Azure and it is also available in Google Cloud platform should basically data Lake is built on top so the lake house architecture is actually built on top of a regular data Lake but what the lake house architecture does is that it brings reliability to data Lake what does this mean bringing reliability to data Lake so like I explained before data Lake can store all kinds of data so it can store structured semi-structured and unstructured data now one problem with this was that as data Lake stores all kinds of data even though our data is stored in data leak so this led to an increase of data uh in the data Lake that had no business value it led to data Lake turning into a data swamp new and new raw data was getting added to data lakes and that data was not in a reliable format that data was not in a format on which you could perform your business intelligence or machine learning tasks so it led to the development of data forms and because uh uh data swamps we're not adding any business value so this became a problem so data breaks all this problem as well databricks adds reliability to your open data Lake and the way this databricks does is by using Delta leak so Delta lake is not a storage format but Delta lake is a series of rules that databricks has defined and it's sort of an additional formatting on top of your open data Lake to make your data more reliable so what does the databricks lake house platform The databricks Lakehouse platform combines the asset transactions and data governance of data warehouses with the flexibility and cost efficiency of data leaks to enable business intelligence and machine learning on all data so your data warehouses had structured data the way databricks works is that you have a data Lake that contains structured semi-structured and unstructured data and on top of that data Lake you have a Delta Lake layer which is the layer that is bringing reliability to your data Lake this Delta Lake layer has the probability of Performing asset transactions so if anybody in the here is familiar with data warehouses they know the data warehouses have the capability of Performing asset transactions but data leaks do not have that capability asset transactions are the reason why data in data data warehouses was reliable so by bringing that capability to data Lakes databricks Lakehouse platform ensures that your data lake is also reliable what so the what are the components of databricks Lakehouse platform so the first component is Delta Lake which as I explained for Delta lake is not a storage format but it is set of rules and additional formatting that is starting reliability to your data Lake so Delta lake is capable of bringing asset transactions to data Lake what are asset transactions asset stands for atomicity consistency isolation and durability so Atomic setting means that all your transactions either succeed or fail completely what does this mean so for example we have a table in a data warehouse and we are performing some sort of updates on that table let's say that our update operation fails halfway now the question here is Will some of the rules get appended in the table updated in the table or will all rows be lost what will happen in this case now the scenario the data that data warehouses generally performed and which the lake house platform also performs is that the transactions succeed or fail completely either all rows are updated or no room is updated this ensures the data is always reliable the second thing is consistency which guarantees that given how a given state of data is observed by simultaneous operations again let's say that someone is performing a right operation on your table and another person is performing a read operation on your table now the question is what what is the data that will be viewed by the person who is reading the table a fetching grows from the table consistency ensures that whatever is the latest version of the table is always fetched so and both of these steps the read and write steps are performed independently of each other without conflicting with each other the next step is isolation which refers to how simultaneous operations potentially conflict with one another now this is another thing that data that uh data warehouses often get and that is the state of Deadlock so suppose that uh some one is squaring from two different people acquiring from the same table now if the queries conflict with one another it can lead to a state of Deadlock where the your data warehouse does not know which query to execute first database solves this problem by isolating all queries so all queries are isolated with each other and they're executed seriously so this ensures that there is no conflict of one operation with another and the last is durability meaning that all committed changes are permanent once a transaction has completed that change now becomes a permanent part of your table the next thing that Delta lake has is data versioning so in Delta Lake all versions of your data are being maintained by separate log files so there are two things that are saved when we create a Delta table in database the first is there are data files and the second is our log files so log files are the ones that keep track of what changes were made in the data and what changes are actually permitted and through these log files we can roll back to any previous version of our tables the next thing that is part of the Lakehouse platform is Unity catalog so Unity catalog ensures that you control the visibility of your data you can control who which persons in your on your organization have access to which data so you uh Unity catalog provides a way of data governance and also a safe way of data sharing so let's take a look at Key advantages of databricks lake house platform the first is that transaction support to ensure multiple parties can concurrently read or write data so this as I explained previously in the asset transactions part uh all uh read and write operations are performed independent of each other and when someone is reading from the data whatever is the latest version of the data of the table is stretched yeah then in the lake house platform there is a schema information enforcement to ensure data Integrity so although the lake house platform is built on top of an open data Lake and as we saw before data Lake traditionally does not have a schema on right but databrick's Lakehouse platform does have a schema enforcement so this should source that whatever data we are writing adding to our lake house has in data integrity then data governance and auditioning mechanism so through Unity catalog you can perform data governance then your databricks Lakehouse platform has support for business intelligence tools so now you don't need to convert your uh your data transform and clean your data and push it into a data warehouse before you can do perform business intelligence on it rather you can directly connect your business intelligence tools to your databricks Lakehouse so power bi and Tableau have direct integration support with your data breaks tables then storage and compute are decoupled so this means that you can upscale any of these without conflicting with the others so you can easily add more users or you can easily add more uh storage and this actually is very cost effective then the data that is stored in lake house is stored in open and standard storage formats and it has support for all types of data you can show your structured unstructured in semi-structured data so you can perform your machine learning tasks to stream analytics Etc your business intelligence all of that on one platform and lastly it also has support for end-to-end streaming so you can process your data and report it in real time So based on all of this you can now see that we can all the problems that we had before the data warehouses and data leaks are now catered with this databricks Lakehouse platform so now this one platform with with this one platform you can create your power bi dashboards or your Tableau analysis and you can use this one platform for your machine learning task as well your data engineering tasks so there is uh all your data teams are working on one platform so this also ensures the uh the part that or you don't have to create multiple copies of your data if you don't have to have a separate copy in your data Lake and then a clean copy in your data warehouse rather you have a single source of truth of data in your one platform so this is a slide that captures uh that summarizes the points that I discussed before so data warehouses were traditionally used for business intelligence and reports and if we have if we wanted to if we had data Leak with structure semi-structured and unstructured data then you had to perform ETL first and dump your clean data into a data warehouse before you could perform a business intelligence and or create reports on that data but with the databricks lake house you have all your data in your data Lake and on top of that is the metadata and governance layer which is the Delta layer and with one source of data you can perform all of your tasks all of your business intelligence reporting data science and machine learning tasks can be performed additionally databricks also has these other open source uh platforms available so Delta Lake ml flow and Apache spark are all open source platforms that are available on data bricks Apache spark is actually a compute engine that is much faster than traditional compute engines so your computations are extremely faster when you use Apache spark so spark is available stand alone as well because it is an open source platform but databricks utilizes spark in all its operations ml flow is a library that is a platform that helps in automating your entire machine learning life cycle again this is also an open source platform it's available without databricks as well but databricks has integrated all these open source functionalities into its one single platform now let's understand The Medallion or multi-hop architecture so this is the architecture that databricks proposes it is the way in which databricks proposes customers should store their data in the late house platform so in this multi-hop architecture there are three layers the first layer is called as the bronze layer which is the raw data layer so you ingest your raw data from your batchel that is coming in batches or it is streaming data and you ingest it in a single layer the bronze layer then you have your silver layer on top of that which is a filtered and cleaned version of the data and lastly you have to go live so the gold layer is a layer that is catered to very specific use case and the goal layer has been prepared after all the aggregations and all data cleaning so it can to queries on gold layer are much faster there is very low latency if you're querying from cold air and this is the layer on which you can directly build your business intelligence dashboards and your machine learning workflows now we're going to log into databricks and see the different services that databricks provides live so for this session I'm using the Azure databricks environment as you can see there are three personas available here the first one is data science and engineering then we have machine learning and then we have SQL so there are two different personas and on the back end they're all using the same Hive metastore so the advantage of this is that whatever tables you are creating in your data science and engineering workspace they can be queried in your SQL or your machine learning workspace so again the single source of Truth is available for your different data teams good in this session we are going to focus on the data science and engineering workspace So within this you can see we have a workspace tab a repos tab then a recent staff where all your recent five recent files are located a data tab a computer tab and workflows tab we are going to start with compute so with this compute tab you can create clusters clusters are just a collection of modes nodes can be considered as one node as an individual computer and a collection of these multiple nodes is a cluster so each node has its own CPU course and since data databricks provide uh helps us execute our task in this cluster mode so we have a much faster speed of executing our tasks so to create a cluster you can go you click on this create cluster button and you can give it any name so I can give it demo one and you can decide whether it should be a single node cluster or a multi-node cluster so a single node cluster will have only a single node and one driver for course whereas if you select multi-node then you will have a single driver and you can select the number of worker machines you can you can use this access mode to Grant access to users so you can limit access to users here as well so not every user on this platform will have access to this particular cluster so here I'm only giving this access to myself then you can select a databricks runtime version so these runtime versions all correspond to different spark and Scholar versions then here from here you can select the worker type so how many memory and number of ports each work the machine should have and then there is uh in advance options you can enable these extra facilities so you can enable logging facility if you check this then all of your cluster logs will be saved so cluster logs tell us when the cluster started if the cluster fails due to some reason then cluster logs will tell us what was the reason of failure why did the cluster Clash crash so I already have a cluster created for the purpose of this demo so I'm using a single node cluster for this now this was an all-purpose clusters but we can have job clusters as well so job clusters are clusters that are specific for workflows workflows are basically automated jobs so you can exit you can configure automated jobs to run your databricks notebooks scheduled notebooks or you can manually trigger your workflows um if you go to our Workforce tab we can see that we have jobs job runs and Delta live tables so basically with these jobs like I said before you can configure these jobs to run your database notebooks add scheduled time or you can manually trigger these jobs as well so you can just give this a name you can select what does this job for so I can just select this notebook from workspace and I can navigate to my workspace and select the bot of a notebook that I want to trigger so let's say that I want to trigger this one notebook I can just easily do that from just by selecting this path and from here I can either use an existing all-purpose cluster or I can create a new cluster a job cluster so the advantage of separating job clusters from all clusters is that all plus all purpose clusters are used in can be shared in all these resources so all purpose clusters can be shared across different workspaces and repos so what happens is the though that if multiple notebooks are connected to the same all-purpose cluster then if there is a lot of load on the cluster there is a risk that the cluster can crash so that is why we have the functionality of creating a separate job cluster so this job cluster will only get triggered when we uh when we start this job run and it will automatically terminate once this job finishes executing so because of this this job cluster is only related to this particular task this particular notebook it's independent of the other uh it's independent of the other notebooks that are present on this that the databricks workspace so because of this it won't be it won't crash if there overloading issues now let's navigate to our workspace so our workspace is the place where all of our code files are present all of our code notebooks are available so you can organize your workspace by creating folders in it so you can create a new Full notebook or you can create a new folder and you can then organize your different notebooks under that folder so I'm going to start by opening this notebook detail using python so here we can see the language that has the default language that has been set for this notebook databricks allows you to work in four languages so you can work in Python you can work in SQL you can work in Scala or you can work in art here here I have selected python as a default language then to execute the commands in your notebook you have to connect it to a cluster so I have this cluster demo cluster present and I have it connected to my notebook and this cluster has already started if the cluster is not attached you have to attach a cluster first before executing the cells in The Notebook and you have to make sure that your cluster is started your cluster is in the running state and only then you can execute the cells in your notebook then I can also sketch uh add these this notebook to the databricks workflows by the schedule option so if I click on the schedule option I can create a job on this notebook and I can uh I can set the trigger as per my wishes so it can either be a manual Trigger or it can be a schedule or not and I can again attach it to a new job cluster I can create an existing move cluster and I can also add alerts in so I can make sure that which person should be loaded when this job starts or succeeds and fails here I can also add some specific parameters so some key value pairs can be given here and these can be referenced within my notebook as well so this can be anything for example let's say that uh do you have one notebook you have added one notebook that you want to run or for many different countries so let's say we have data of multiple countries coming in we have data coming in from Pakistan you have data coming in for it from India it's the same type of data again we can use the example of sales data of a company so let's say we have a single ETL written that is going to handle the data for Pakistan and the data for India but we need to specify before running the notebook we need to specify for which country we are running so we can configure this key within our notebook and on runtime we can just specify that let's say country is Pakistan and this key can be passed into our notebook so that the ET only executes for Pakistan so databricks notebooks have magic commands magic commands allow us to render markdowns in a Sim so markdowns are ways that you can make sure notebooks more readable so here you can see that by using this magic command MD I'm able to add uh I'm able to add multiple headings in my notebook and this it makes it easier to navigate my entire notebook so let's say if I wanted to jump down I can easily jump down to this stage so it is an easier way of navigating my node then you can also add whole text italicized as orderless unorderless you can add embedded links here you can add tables here you can also add images here like I have and this syntax is similar to the syntax that is used in HTML to add images okay then in this particular notebook so the steps that I have performed in this particular notebook are first of all I have mounted my Azure blob storage account with my data bricks so Azure blob storage is is your azure's data lake so I have a file a sample file prepared that is uh uploaded on the Azure blob storage and I want to query it I want to uh I want to ingest it first in my data breaks workspace and I want to perform some Transformations on it so the first step is connecting with the Azure blob storage right and for that I have to mount The Blob storage with my databricks workspace so first of all I'm going to import the libraries and then I'm going to run this command to mount my blob storage so this command is basically uh uh this you just I have specified The Source the path to my blog so my The Blob storage account name is Mama and the blog the container name within the blog is demo then I've specified amount point and I've specified a key a specific key that allows me to combine that allows me to integrate The Blob storage with my data bricks and then now I can easily read data from my blog foreign so here I'm using spark I'm using bit I'm basically using pi spark to interact with spark the compute engine that I talked about before and I'm using spark to read my file from my blog and this is the way that the output is displayed in data breaks so as you can see that databricks has it is showing a very schema-like structure here so all the data is shown in a tabular structure here and it's only showing the first 1000 rules but I can re-execute this uh this command to its uh to show all possible results then I can view this data in any in a plot as well so I can create a bar plot on top this data I can select measures like zoom in zoom out I can go to the plot options to make sure what keys and values I want to to be shown here foreign back to my tabular view and I can also download this in the form of a CSV file and I've also uploaded a file in my databricks built-in databricks file system so to view the database file system you go to the data tab and you select this dbfs databricks file system from here so I have a file uploaded in the file store so the sales underscore data to connect to the databricks file system this is the command that we run lsdbfs and then we can execute this command to read our file now I can perform some Transformations on it and finally I'm going to group by this group this data and I'm going to write it in a day as in save it as a Delta table so what I've done here in this particular ideal is that basically this particular data so this is a just a sample data this data has the sales this data has the price and the quantity sold for Pepsi and it has data coming in in country category brand and sub brand level so we can see that we have uh quantity sold for Pepsi 250 ml and we also have quantity sold for Pepsi 500 ml and I want to know what is the total sales that I've made on in Pakistan and for the whole brand Pepsi so the first step that I did was calculating the sale what is the number of sales that uh Pepsi sold overall so first of all I'm going to calculate on a sub brand level so that is very easy I just uh subtract the discount from the actual price and then multiply it with the quantity sold to get the value of sales then I'm going to group buy this data on Grand level to get the total sales value for Pepsi brand for each state that I'm getting this data and lastly I have saved this data as a Delta table so when saving the data to Delta table we have two options here I've selected mode overwrite but we can also select the mode append so this is useful when we want to uh when we have let's say a future date available and we want to add the new the data from the new date into this Delta table so we will use the mode append now querying from Delta tables is similar to how you write SQL queries on your regular table so it's just the same syntax I'm going to select from saves one thing you will notice in this is that I've added this percentage SQL in this cell this is because the notebooks default language is set as python so to run another language within this python notebook I need to specify this magic magic amount so I needed to I have specified percentage SQL to make sure that this command understands that I want to execute an SQL query if I change my language from python to SQL then now if I cannot run this command again without this percentage python sign so if I remove this percentage python sign it's going to give me an error as you can see it gives it has given me an error because it has not understood that this is a python statement it's thinking that it is still an SQL statement so to make sure that the cells understand this is a python script I add the magic command percentage Python and now it's executing fine pin now there is another magic command that is available in databricks notebooks and that is the percentage run command so this percentage run command is useful for executing another notebook from one notebook so I have another notebook here by the name test and it has only one variable specified and name which has the value test file so I'm going to try calling print name but it's giving me the error the name is not defined but if I run this percentage is run command so what it's going to do is it's going to execute this entire script so here it only has a single command but a percentage run executes the entire notebook so everything every variable every table every view that is initialized in this notebook will be available in this notebook now so if I run this again now it's going to understand the print the name variable and Sprinter test file foreign so it's going to tell you all the mount points and as we saw that in the start of this notebook we mounted our storage account our blob storage account with our database print space so if I now want to unmount my storage account I'm going to execute this command the unmount command will unmount the storage account from the databricks workspace okay so now the blob is unmounted and we are going to try a meeting from The Blob again and now it's going to give an error as you can see given error path does not exist so this was for um for python on uh ETL using python on databricks now we're going to take another look at this ETL using SQL is very similar to the sqls and tags that you use traditionally so here I have a sample table I'm going to create a sample table and uh this command will give me an error of course if this table already exists over that I should execute this query table if not exercises then I'm going to insert some data in this table so again this is the same syntax as you can do you traditionally use an SQL foreign I can also update existing records and I can see that the existing records are now updated I can delete records and this is an interesting command so the use of merge so data breaks allows you to use merge so what Mars does is that it can perform inserts deletes and updates simultaneously so we have a new table here I'm creating a new table here and for each record I'm specifying the type vibration that should be performed so this is a row that should be inserted the store should be deleted this should be inserted and this should be updated and on the merge query I'm going to specify that first of all I'm going to specify a joined condition so I'm joining on ID and then I am going to specify that when this ID is matched and if the type is update then you have to update the record if the type is delete then you have to delete the record and if the type is insert then you have to insert the record and now if we query the table again they're going to see so the new records have been added here and we can also drop our table just like an instrument so another thing that we can do that databricks provides us is version controlling so it's very important if we are talking about developing etls it's very important to have a version control system enabled to be able to build a cicd pipeline so that you can productionize your code easily so databricks provides State integration facility as well so first of all you can view the version history of each notebook so you can do whatever you did and you can at any point in time you can see an older version of the table as well and you can restore that older version as well then another thing the databricks allows is to sync your individual notebooks as well as this report functionality with GitHub and not just GitHub you can sync it with Azure devops you can sync it with GitHub so it has a some git provision providers available so the way to do it is this to sync uh get to add you get functionality within data breaks you have to go into this your user settings and you have to go into git integration and then you have to select a get provided so I already have integrated with GitHub but I can select this so you can see there is a bunch of git providers available here and I can select any get provider so here I'm just going to show you how to integrate with GitHub I'm going to enter the git provider username and now I have to enter a token here this person access token that is going to be generated from GitHub so generating this token is very easy you simply go to your GitHub then you go to settings yeah scroll down to developer settings and then personal access tokens then you can generate a new token easily you can specify that what should be the expiration date for this token you can specify a name you can add which functionality should this token have access to so right lead whatever let's say and then you can copy this token and you just have to paste this token here in your user settings so when you paste and save now you get is integrated with your database workspace then the next step is how to connect your workspace notebooks with GitHub repository so my notebooks are already connected with GitHub I'm going to show you the step so again I will go into this file version history and here you can see the word get synced but I am going to unlink and show link again to show you guys how I did this so when this GitHub is not synced the shows get not linked to link it you go here you select Clank and then you add uh to add the name of your Repository so this is my depository I'm going to copy this and I'm going to add the link to my repository then I'm going to specify the branch so initially the sales Master here this is an error on databricks get it's not going to link with Master this is because on GitHub recently the master branch has been renamed to Main Branch so if you try to link it with Master Branch it gives an error so you have to select Main Branch from here and now if you save it's going to sync and once it is synced you can now choose to make a commit to your branch so if I click on Save now I can commit to get I can add a description let's say first commit and now this notebook is synced with my GitHub repository there is another way of integrating uh GitHub repositories in your data breaks and that is by using this repos functionality so if we go to this report I'm going to click on ADD book and here I'm going to again give the negative of my Repository select get provider add depository name and submit okay so here it is running into an error so thank you I'm going to try by creating a new Repository so I created new repository and you've given I mean test people so while this happens we are I'm going to show you guys one more thing and that is the last thing here is the data so this data tab is the tab that allows you to view all the tables that you have created and as well as a DB FS which is the database file systems so if I go back to this ETL using python you can if you remember the table that I created at the end was this sales table so we can see that this table is stored in the default database since so you can create your own database as well you can create new databases and you can add tables in those specific databases as well and if I go to this sales table it's going to show me this schema the each data types as well as some sample data of this table so basically this Delta table is not actually a table but this these Delta tables are saved as parquet files in backend but with this Delta Lake featured that databricks has it enforces a sort of schema on the quarky files as well so this is the format in which you are able to view them and because of the schema enforcement whenever we are writing to this table it checks the schema first and ensures that schema Integrity is maintained so in that way it adds uh data reliability so another thing that uh databricks helps us is the governing permissions for all of these so one thing one way to do that is as I talked about creating a Unity catalog and I have a Unity catalog already added and I can configure and add permissions via this Unity catalog so this Unity catalog once you enable it you go back into your database workspace and you can set table access controls cluster permissions and make sure that your data is not viewed by users that you don't want to your cluster is to your people don't have access to use all the Clusters available also you can also limit the number of people who can query from a specific blob storage account so like we saw we've mounted blob in our databricks workspace but we don't want everyone to have visibility on the data that is present in that blob storage so we can limit that functionality as well from this another way of doing this is if you don't want to enable Unity catalog you can go into your admin control and you can go to your workspace settings and you can select which access controls you want to enable so like table access controls enable you can enable them disable them then job visibility cluster visibility control all of these controls are enabled right now in my system uh databricks dbfs file browser so if I disable it I will not be able to see the dbfs in this tab I have to refresh it foreign I will be able to view the databricks file system so by using this admin console you can limit some permissions so this repository is ready now and I'm going to show you how I was able to how we can add a repository in a folder so again uh I'll have to take a look at this point this is giving an error but generally this is the way to do it and it you can easily add any repository here so generally we don't run into errors with this but right now it's giving me errors so I'll have to take a look at this later on so there are two ways of integrating that get you can integrate with either the workspace option or you can integrate with either the request option so both of these options allow you to sync your databricks notebooks with your GitHub and you can commit your changes to you get to your GitHub repository and you can also get data from that is saved in a particular image and you can call it in your GitHub report uh you can import it in your notebooks so again if you go into this version history then it's going to it shows me all of the things that I've done on this so what was the last comment that I made so it's showing me this one comment that I made and all of the changes were before that treatment so at any point in time I can uh I can get the data from this limit I can restore this version any version and whatever has been permitted to GitHub as well can be restored so additional functionality is that you know if you don't want to have this backlog of extra revisions then you can just clear these divisions foreign Ty has been done on your notebook so like in this recent activity tab it is showing me that I have run a command at this point whatever changes I've made it shows you a history and it also shows you the users who did that made those changes so this is a way of making sure of viewing changes within your notebook the last thing from this workspace that I'm going to show you is how to export notebooks and how to export the folders so one way is that you go into this file then export and you export the source file and this is going to export the entire file with the extension whatever extension was here so if I selected python here it will export it as with python extension so you can see now it got exported with python extension I can also export an entire folder and it's going to get exported in the form of a zip file and in a similar way I can also import I can also import folders or notebooks to my databricks workspace so this was an overview of databricks there are additional functionalities that are available in databricks for example data databricks has Delta live tables that help you in dealing with streaming data databricks has a functionality by the name of Auto loader that helps you identify if a new file has come in in a particular folder then Auto loader automatically processes the processes that file so basically let's say that you want to process some data as soon as it is available and you don't know when that file will get uploaded so uh one way of doing it is that someone manually goes and checks in the folder to see when the new file has become available and then they start processing it but with auto loader as soon as the file becomes available the file is processed so this was an overview of databricks at the end here I have mentioned two ways of use trying databricks for free One is using the data Community Edition so this Edition is free for everyone you can practice here but this Edition has one two limitations one is that you cannot access the reverse functionality from in the community Edition and the other is that you cannot create workflows so it basically lets you create a cluster and then play around with the workspace notebooks the other option to if you want to explore all the services available in databricks is to use the databricks free trial so databricks has a 14 day free trial and this free tile is you can get this free dial to uh any of the three clouds is your AWS or Google Cloud platform in particular if you choose azure Microsoft Azure gives you 200 for free in the first month so you can use the 14-day free trial and you can upgrade your data breaks and use it for another 15 16 days until you have those two hundred dollars still available in your account foreign [Music] I also have these added so all the steps that I showed you on how to create and manage clusters how to get started with databricks notebooks and how to create workflows are listed in these notebooks and that's it from my side with us uh there are some questions in the Q a section if you could have just in peace yes okay um so uh uh someone has asked the question what is the process of ETL so basically ETL stands for extract transform and load so the first step in uh in any uh data when you want to process any data as we understand data is first of all available in a raw format so the first thing we have to do is ingest data so get the raw data ingested in our workspace so either that raw data is available in let's say the data Lake of azure or Amazon S3 or we have to make sure that we extract the data and load it into our system so that first step the data ingestion step or extraction step is the e-step the next step is transformation so this data is available in a raw format we have to make sure that we clean this data so for example let's say that we have uh we want to we have data same type of data coming in from multiple sources so some uh one some vendor is providing us data in the form of Excel files another vendor is providing us data in the form of a data warehouse another is providing Us in the form of RK files so we have different files giving us the same kind of data but it can be for different categories different countries so like Pakistan's data is coming in in a data warehouse India's data is coming in in parquet files now we have to consolidate that data in one format and we have to clean that data we have to do perform some sort of mapping operations on that data and we have to perform some information so this is the second step in an ETL process and the last step is load so load is when once you have you are done with the Transformations you're done with data cleaning you have created whatever new requirements you had you have done your feature engineering next step is now loading the data so that it can be used for your bis business intelligence purposes or your machine learning so this is a process of ETL then someone else has asked what's the difference between edl and Eld so the difference is very simple in ETL you extract the data then you transform and then you loot but in elt you first extract and load the data in your system so the raw data is extracted and it's loaded in your system and it's now present there and whenever needed to to the transformation then so this change is idea this is a difference between ETL and elt then the difference between data league and data lake house so that is again as I explained in the start data lake is as storage format that has data all kinds of data structured data unstructured data semi-structured data but the data present in the data Lake does not have any reliability because there is no schema enforcement so data I can keep on getting added in a data Lake until it becomes a data swamp data lake house is not a storage format but it is just a reliability layer on top of a data lake so you still have a data Lake and it has all the structured then structured and unstructured data but in a data lake house when you create the Delta tables the Delta Lake it creates a schema of you it needs a schema of your data so there is schema enforcement and then second of all the asset transactions the asset transactions that was the power of data warehouses so data lake house has the power of asset transactions so all your transactions are either committed or failed and there's consistency data reliability data is consistent across all the people who are reading the data and all the right before write and read and write are done concurrently and all the right operations are serialized so these added uh reliability and governance on top of data lake is what transfo

Original Description

Databricks is a cloud-based collaborative data science, data engineering, and data analytics platform that combines the best of data warehouses and data lakes into a lake house architecture. In this DataHour, Umamah will introduce the set of fundamental concepts you need to understand in order to use the Databricks Data Science & Engineering workspace effectively. 🔗 More action pack session here: https://datahack.analyticsvidhya.com/contest/all/ Stay on top of your industry by interacting with us on our social channels: Follow us on Instagram: https://www.instagram.com/analytics_vidhya/ Like us on Facebook: https://www.facebook.com/AnalyticsVidhya/ Follow us on Twitter: https://twitter.com/AnalyticsVidhya Follow us on LinkedIn:https://www.linkedin.com/company/analytics-vidhya

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Analytics Vidhya · Analytics Vidhya · 36 of 60

← Previous Next →

The DataHour: Data Science in Retail

The DataHour: Data Science in Retail

Analytics Vidhya

The DataHour: Anomaly detection using NLP and Predictive Modeling

The DataHour: Anomaly detection using NLP and Predictive Modeling

Analytics Vidhya

The DataHour: Energy Data Science Project from Scratch

The DataHour: Energy Data Science Project from Scratch

Analytics Vidhya

The DataHour: Explainable AI Need and Implementation

The DataHour: Explainable AI Need and Implementation

Analytics Vidhya

The DataHour: Google Cloud AI/ML

The DataHour: Google Cloud AI/ML

Analytics Vidhya

Prediction to Production in Machine Learning #machinelearning #prediction

Prediction to Production in Machine Learning #machinelearning #prediction

Analytics Vidhya

Practical Applications of Data science in Ecommerce

Practical Applications of Data science in Ecommerce

Analytics Vidhya

How to tackle Overfitting?#machinelearning #overfitting

How to tackle Overfitting?#machinelearning #overfitting

Analytics Vidhya

Building Data Pipelines on GCP #googlecloud #datapipelines #data

Building Data Pipelines on GCP #googlecloud #datapipelines #data

Analytics Vidhya

Hands-on with A/B Testing #abtesting #datascience

Hands-on with A/B Testing #abtesting #datascience

Analytics Vidhya

Efficient Implementations of Transformers #transformers #cnn #machinelearning

Efficient Implementations of Transformers #transformers #cnn #machinelearning

Analytics Vidhya

Modern Deep Learning Architecture #deeplearning #architecture #deeplearningtutorial

Modern Deep Learning Architecture #deeplearning #architecture #deeplearningtutorial

Analytics Vidhya

Key steps for Designing Artificial Neural Network (ANN) for Image classification #machinelearning

Key steps for Designing Artificial Neural Network (ANN) for Image classification #machinelearning

Analytics Vidhya

5 things you should know about Azure SQL #azure #sql #datahour #datascience

5 things you should know about Azure SQL #azure #sql #datahour #datascience

Analytics Vidhya

AI & ML in the Automotive Industry #machinelearning #ai

AI & ML in the Automotive Industry #machinelearning #ai

Analytics Vidhya

Building Machine Learning Models in BigQuery

Building Machine Learning Models in BigQuery

Analytics Vidhya

NLP aspects in Telecommunication Industry

NLP aspects in Telecommunication Industry

Analytics Vidhya

Practical Time Series Analysis

Practical Time Series Analysis

Analytics Vidhya

Fundamentals of Quantum Computing

Fundamentals of Quantum Computing

Analytics Vidhya

A DAY IN THE LIFE of a Data Scientist (From waking up to working on algorithms)

A DAY IN THE LIFE of a Data Scientist (From waking up to working on algorithms)

Analytics Vidhya

Classification Machine Learning Model from Scratch

Classification Machine Learning Model from Scratch

Analytics Vidhya

Knowledge Graph Solutions using Neo4j

Knowledge Graph Solutions using Neo4j

Analytics Vidhya

Model Guesstimation (MLOps)

Model Guesstimation (MLOps)

Analytics Vidhya

ETL Pipelines in Google Cloud Platform

ETL Pipelines in Google Cloud Platform

Analytics Vidhya

Key steps for Designing Convolutional Neural Network(CNN) for Image Classification

Key steps for Designing Convolutional Neural Network(CNN) for Image Classification

Analytics Vidhya

Getting Started with AWS EC2 #amazon #aws

Getting Started with AWS EC2 #amazon #aws

Analytics Vidhya

How to Use Azure NLP and Graph Databases for Intelligent Knowledge Mining

How to Use Azure NLP and Graph Databases for Intelligent Knowledge Mining

Analytics Vidhya

Certified AI & ML BlackBelt Plus Program #shorts

Certified AI & ML BlackBelt Plus Program #shorts

Analytics Vidhya

Visualizing Data using Python #machinelearning #visualization #python

Visualizing Data using Python #machinelearning #visualization #python

Analytics Vidhya

DCNN for Machine RUL Prediction using Time-series Data #timeseries #machinelearning #datascience

DCNN for Machine RUL Prediction using Time-series Data #timeseries #machinelearning #datascience

Analytics Vidhya

M in ML stands for Math & Magic

M in ML stands for Math & Magic

Analytics Vidhya

An Unsupervised ML approach using Clustering

An Unsupervised ML approach using Clustering

Analytics Vidhya

Customizing Large Language Models GPT3 for Real-life Use Cases #gpt3 #datascience

Customizing Large Language Models GPT3 for Real-life Use Cases #gpt3 #datascience

Analytics Vidhya

Model Parameters vs Hyperparameters - Techniques in ML Engineering #machinelearning

Model Parameters vs Hyperparameters - Techniques in ML Engineering #machinelearning

Analytics Vidhya

Practical MLOps #mlops #datascience

Practical MLOps #mlops #datascience

Analytics Vidhya

Data Engineering with Databricks #dataengineering #databricks

Data Engineering with Databricks #dataengineering #databricks

Analytics Vidhya

Multi-Objective Optimisation

Multi-Objective Optimisation

Analytics Vidhya

When Airflow Meets Kubernetes

When Airflow Meets Kubernetes

Analytics Vidhya

Analytics Vidhya

Learn Convolutional Neural Network for Image Recognition

Learn Convolutional Neural Network for Image Recognition

Analytics Vidhya

Extracting Value from Data

Extracting Value from Data

Analytics Vidhya

How to measure Marketing Channel Effectiveness

How to measure Marketing Channel Effectiveness

Analytics Vidhya

Transforming Lives | Data Science Immersive Bootcamp

Transforming Lives | Data Science Immersive Bootcamp

Analytics Vidhya

Stock Market Analysis - AI driven approach

Stock Market Analysis - AI driven approach

Analytics Vidhya

Become a Data Engineering Professional in 2022 | Future Trends + Skills Required

Become a Data Engineering Professional in 2022 | Future Trends + Skills Required

Analytics Vidhya

Ensemble Techniques in Machine Learning #machinelearning #ensemble #datascience

Ensemble Techniques in Machine Learning #machinelearning #ensemble #datascience

Analytics Vidhya

The Power of Visualization | Tableau Full Course | Analytics Vidhya

The Power of Visualization | Tableau Full Course | Analytics Vidhya

Analytics Vidhya

Demand for Data Engineers is on the Rise | Data Engineer | Analytics Vidhya

Demand for Data Engineers is on the Rise | Data Engineer | Analytics Vidhya

Analytics Vidhya

Data Visualization in Data Science | DataHour | Analytics Vidhya

Data Visualization in Data Science | DataHour | Analytics Vidhya

Analytics Vidhya

Role of Optimization in Machine Learning & Deep Learning | DataHour | Analytics Vidhya

Role of Optimization in Machine Learning & Deep Learning | DataHour | Analytics Vidhya

Analytics Vidhya

Solving any Machine Learning Problem | Approach and Steps Involved

Solving any Machine Learning Problem | Approach and Steps Involved

Analytics Vidhya

Topic Modeling Explained with Implementation | Using LDA in Python | DataHour by Arpendu Ganguly

Topic Modeling Explained with Implementation | Using LDA in Python | DataHour by Arpendu Ganguly

Analytics Vidhya

Data Engineering in E-Commerce | The Best Case Study

Data Engineering in E-Commerce | The Best Case Study

Analytics Vidhya

Introduction to Classification using Azure Machine Learning | DataHour | Analytics Vidhya

Introduction to Classification using Azure Machine Learning | DataHour | Analytics Vidhya

Analytics Vidhya

Introduction to Federated Learning | DataHour | Analytics Vidhya

Introduction to Federated Learning | DataHour | Analytics Vidhya

Analytics Vidhya

Diffusion Models for Generative Arts | DataHour | Analytics Vidhya

Diffusion Models for Generative Arts | DataHour | Analytics Vidhya

Analytics Vidhya

Master Google Analytics in 1 Hour | DataHour | Analytics Vidhya

Master Google Analytics in 1 Hour | DataHour | Analytics Vidhya

Analytics Vidhya

Learn Hypothesis Testing | DataHour | Analytics Vidhya

Learn Hypothesis Testing | DataHour | Analytics Vidhya

Analytics Vidhya

A Practical Approach to Kaggle Competition | DataHour | Analytics Vidhya

A Practical Approach to Kaggle Competition | DataHour | Analytics Vidhya

Analytics Vidhya

Making AI work for Business | DataHour | Analytics Vidhya

Making AI work for Business | DataHour | Analytics Vidhya

Analytics Vidhya

Databricks is a cloud-based platform for data engineering, data science, and data analytics that combines the best of data warehouses and data lakes into a lake house architecture. This video introduces the fundamental concepts of Databricks and its features, including Delta Lake, Apache Spark, and Unity Catalog.

Key Takeaways

Create a cluster in Databricks
Select a Databricks runtime version
Enable logging facility
Create a single node cluster
Create a job cluster
Mount Azure blob storage account to Databricks workspace
Import libraries
Run command to mount blob storage
Use Spark to read data from blob storage
Perform transformations on data in Databricks workspace

💡 Databricks combines the best of data warehouses and data lakes into a lake house architecture, providing a scalable and cost-effective solution for data engineering, data science, and data analytics.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Pipelines

View skill →

Building a Dog Breed Identifier App from scratch - DogNet

Building a Dog Breed Identifier App from scratch - DogNet

Aladdin Persson

Complete Dockers For Data Science Tutorial In One Shot

Complete Dockers For Data Science Tutorial In One Shot

Part 6 | Deploy ML Model on Kubernetes | Auto-Scaling with HPA and Monitoring with Prometheus

Part 6 | Deploy ML Model on Kubernetes | Auto-Scaling with HPA and Monitoring with Prometheus

Abonia Sojasingarayar

Vertex Pipelines: Qwik Start

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Automate R scripts with GitHub Actions: Deploy a model

Related AI Lessons

How I built the OSS alternatives directory: GitHub ETL, Turso, and the UPSERT trap I hit

Learn how to build a data pipeline for an open-source alternatives directory using GitHub ETL, Turso, and Claude Haiku summaries

Dev.to · MORINAGA

Apache Iceberg in Production: Compaction, Catalogs, and the Pitfalls Nobody Warns You About

Learn how to use Apache Iceberg in production, including compaction, catalogs, and common pitfalls to avoid, to improve data engineering workflows

Dev.to · Gabriel Henrique

Your First Task as a Data Engineer in a New Company? Make the ETL Pipeline Testable

As a new data engineer, make the ETL pipeline testable to ensure data quality and reliability

Towards Data Science

From DataStage and Informatica to Databricks Medallion Architecture: Why Migration Is More Than Code Conversion

Learn how to migrate legacy ETL systems like DataStage to modern architectures like Databricks Medallion, and why it's more than just code conversion

Dev.to · Amit Kumar Singh

A Moment Frozen in Time | Arnav Iyengar | TEDxJenks Youth