AI-Assisted SQL Joins

DataCamp · Beginner ·📊 Data Analytics & Business Intelligence ·5mo ago

Key Takeaways

The video 'AI-Assisted SQL Joins' by DataCamp demonstrates the use of AI-assisted SQL joins in data analytics, specifically using DataCamp's Data Lab and ChatGPT to generate and refine SQL queries.

Full Transcript

[music] >> Hello everyone and welcome to today's session. My name is Reese and I'll be your host and moderator for today. We're going to get started at the top of the hour in about 4 minutes or so. We're just waiting so everyone has a chance to join. However, thank you for joining early. If you haven't done so already, make sure that you register for this session using the QR code that's on screen. There's also a link to the registration page in the video description and you can also head over to datacamp.com/webinars where you'll find this session as well as all of our future sessions [music] as well. We've got uh sessions well, very frequently in February and then we've got a few sessions locked in for March. But yeah, keep an eye out on that page for lots of new stuff appearing very very soon. Uh this is the second part of a series that we're running this week called SQL Rookie to Rockstar. Yesterday we did AI assisted uh SQL for beginners. So check that out. It's live on our code alongs page if you haven't checked it out already. If you registered for today's sorry, yesterday's session, it will be in your email inbox >> [music] >> and there's also a link to it in the resources as well. So yeah, if you want to catch up with yesterday's session, the link to do so is in the resources [music] document that is linked in the chat and also in the video description as well. If you have any questions at any point throughout the session today, let us know in the chat. We're going to be covering your questions for the last 10 minutes of the session. So make sure that [music] you stick around for that. That's about it for me for the moment. I'll be back to repeat these messages for any new joiners [music] shortly. But in the meantime, check out the notebook if you want to code along with us today. We're going to be [music] using DataCamp Data Lab and you do not need a premium account to code along with us. You can do everything on a free account. So yeah, check that link out in the chat. I'll be posting it again very shortly. And yeah, you can make your own copy of the notebook we're going to be working in live today. Hello everyone and thank you for joining today's session. My name is Reese and I'll be your host today. We're going to get started very shortly in just under 2 minutes time. >> [music] >> Until then, make sure you're registered for today's session. There is a QR code >> [music] >> on screen. There's also links to do so in the chat and you can also head over to datacamp.com/webinars. [music] We've got another session tomorrow that's very interesting. So be sure to check that out. Although [music] I'll give it a bit of a plug uh later on. Uh if you joined late for today's session, I just seen your your comment Dre hobbies on YouTube. If you joined late, that's fine. Make sure you register because we can send you the recording as well as the resources [music] as well. So yeah, make sure that you register for today's today's session. And if you want to code along with us live, check out the link that I've put in the chat. >> [music] >> I put a message in saying code along with us in Data Lab and follow that link and you'll be able to create a copy of the notebook that we're going to be using today. If this is your If this is your first time in [music] a DataCamp code along, welcome. We're going to keep this nice and hands-on for today's session. And [music] yeah, we'll be starting very shortly. So one final time, welcome [music] everyone. If you haven't done done so already, please register for this session. If you'd like to code along with us today and yeah, start using the AI assistant that we've got in Data Lab, then please do check out the resources and the link to code along with us [music] in the chat. I'll be posting them again very shortly. If you have any questions at any point throughout the session, let us know in the chat as well. We're going to be answering your questions for the last 10 minutes of the session today. >> [music] >> And if you didn't catch yesterday's session on AI assisted SQL for beginners, there is a link to watch the recording in the resources [music] document as well. Uh thank you Tom. Just seen your your comment about I hope the AI servers are all warmed up. Yes. In theory, we've got everything ready. We shouldn't have any hitches today. So [music] yeah, let's all keep our fingers crossed regardless. Cool. And with that, I think I will appear on screen and we're ready to go. Okay. Hello everyone and welcome to today's session. As I mentioned, I'm Reese. I'm covering for Richie on hosting duties at the moment. Although Richie will be back next week to help you out with everything as he usually does as the host of these sessions. So uh yeah, let's get into it. So hello DataCamp data champs. This is the second part of a two-part code along series on AI assisted SQL and part of the SQL Rookie to Rockstar series as you can see by the little logo that we've got in the corner. Now, you don't need to have joined yesterday's code along to make sense of today. But you can check out the recording. It's linked in the resources. It's definitely useful if you are starting from the beginning. Today's session we have a guest and that is Sarah Billing. She is a senior data analyst at DataCamp. I'll bring you on now, Sarah. Nice to have you. Hi everyone. Uh so Sarah is a long-time DataCamper. Previously, she worked on the curriculum team where she contri- contributed to over 20 courses across SQL, Power BI, Tableau, Python, and Azure. And these days, Sarah manages DataCamp's data pipelines so everyone has access to the right data. So yeah, without any more further ado, I'll hand it over to you, Sarah. Thanks Reese and thanks for having me. Hi everyone. Very excited to be here today talking you through this AI assisted SQL joins webinar. Um before we start, I quickly want to say like it's a topic that I am really excited about. Although I've learned SQL pre-AI, there's like plenty of other um languages, technologies that I am learning like using using different tools like ChatGPT, Gemini, or whatever there is out there. So I'm very happy to talk you through this and yeah, I think we can get started. If you were at yesterday's session with John, you noticed there was a data set which was about rescues by the London Fire Brigade rescuing animals specifically. And we're continuing with that data set or a version of the data set in this notebook. So some of this might feel familiar. But the main goal is focusing on joints. Um joints um something you often need when you're working with data because often data isn't in one data source and need to bring stuff together. So how do we do that? How can we use SQL to do that? Let's have a look. What I've done for you today is I've prepared two tables. The first one which is called animal rescue summary is a summarized version of the table that you were working with yesterday. So you'll notice it has an area code and area which is basically a London borough. And at that level, we've aggregated some metrics. There is a number of incidents which is the number of animal rescues, the number of units deployed which are the number of fire trucks that were there. There was the total service duration hours. So how many hours the rescuers were going for and there's an average hourly cost. So everything aggregated at the area level. And then our second table that we'll use to enrich the data that we have in animal rescue summary is a census data table. Both of these tables are covering data from 20 11. In the census data, you'll notice there's UK boroughs and then for each of these there's demographic data. So about the number of residents for different ages as well. I quickly wanted to mention yesterday you worked with a CSV file. So in Data Lab, there's different options if you want to explore the data using SQL. You can add just in your files the data that you're working with. I haven't done that here. I've used a different approach just to show you that different stuff is possible. So there's a bunch of databases here that are public and you can connect with. What I've done for today's webinar, I've put the data in the projects data database. So you'll notice the tables that we're working with here. But if you want to create your own database data source connection, you can do that as well. In that case, your data will be won't be public. Your data source will be private to you. So imagine you want to set up a Postgres SQL base just like me today, this is the way to do it. And there's amazing documentation here written by our Data Lab team. So if you're interested in doing that, that's the way to set up a database connection in Data Lab. Okay, I think we're ready to dive in. Let's not wait too much longer. >> [snorts] >> Um similar to yesterday, I'll I'll cover this again. So, um I might repeat what you've heard already, but I think it's very important and worth repeating. Uh when you are working with uh an AI, there's a a workflow that we've put here uh that you can follow um that will help you get to the like the best results and writing the best SQL code. Um the way to go through it is basically before you have a question, think about uh the data that you have and what you want to achieve. So, uh before trying to answer a question, uh getting to a solution, uh look at the tables that you have. In this case, we have a database. Is there a schema um that these tables are in? Uh get an understanding. What are you looking at? What's the data? Now we know it's animal rescue data and census data that we're working with. Uh just like take it back and think about like logically, this is what we're dealing with. Um then after that, uh write uh a natural language prompt. And that's what we'll be doing. That's the flow that we'll be using here in the rest of the course. Um think about the question. Very often, [snorts] you can take a version of that and put that in a natural language prompt. The next step will uh let the AI generate the code. And then uh we want to understand the code. So, um I'll be helping you with that throughout this code along. Um the AI will generate some code. Let's have a look. Let's see if we can understand. This is an iterative process. So, as you do this more and more, it will become more and more clear and you can ask the AI questions about this as well. Uh the fourth step is um checking for errors or hallucinations. If you run a cell, you'll notice very quickly if there's errors. Uh if you notice there's like column names that aren't in your table uh or the AI's hallucinating in other ways, those are things you can catch yourself easily. Then we can run the code um and uh verify the results and then basically iterate. So, um working with AI is is typically an iterative process. Uh often times you you'll need more than one prompt to be able to get to the uh results. So, um yeah. This is the framework that you can use and that I use as well when I'm using an AI to to code. So, let's get started. Uh as a first task, um let's have a look at our our data. So, I mentioned we have two tables. Let's have a look how these tables work, uh what they look like. Let's have a glimpse. So, the first instruction is uh look at the first five rows of our rescue data. What I'll do um and what you can do as well in Data Lab, there's two ways to access the AI that's built in. So, either you click this button here or here on the right, there's an AI button. And this will open this prompting window, which will allow you to start um generating SQL code. If you're not using Data Lab or you're running SQL anywhere else, um you could just as easily uh use that tool. Like being ChatGPT, Gemini, or even like uh in VS it's all uh working in a similar vein. So, let's as a first step have a look. So, what would happen if I would put show me the rescue data. Here we go. So, the AI doesn't know what table to select from and it's basically going for rescue, which is a good attempt, I guess, but it shows us the first thing that we'll need to do when we are working with AI, which is give it context. So, depending on how the AI set up, it could be that it already has some context on the tables or the schema uh that you're working with. But if it doesn't, you'll notice pretty quickly and you know um how to solve that. Uh what you can do is be very specific and pass the table name. So, animal rescue summary, that's the table name we'll pass. Animal rescues summary. And if we accept this and run, and let's have a look. And there we go, it starts running. There we go. So, now this has queried um our table. It looked at all rows. So, in this case, there's 33. So, if my instruction is um look at the first five rows, then if I am more specific, it will update the SQL code. There we go. To show me only a subset of the data. There we go. So, um what SQL did is it did a select star, which is selecting every column that there is from our animal rescue summary table and showing only the first five rows. This is in a random order. So, we'll see this is our data set. Now that we know uh and I did put a typo here, so AI is okay with those. Um if we would do the same, I can just copy the code. I don't have to type it all again. Let me do the same here and let me do this for my other data set. Here we go. I can accept and run. And just like that, uh we're looking uh at our first two tables. So, as you can see, uh there's some demographic data here for the different areas uh in the UK that we have. And it seems like there's a region field here as well, which is a higher classification of each of the areas. Cool. So, uh that's us exploring the data. Then let's dive into joining. As I mentioned, uh joining, a very important thing when you are trying to uncover insights from data, because often times your information and data will be spread across different tables. Uh from looking at our data, I think it's pretty clear that the common column here would be area code and area. So, that would be a first hint for us to know like this is the the link between the two tables, the way that we can join. Uh so, that's something for us to remember. Um SQL has different types of joins. Uh I'll brief highlight um what these are in this webinar. So, uh the first one and the one coincidentally that the AI picks uh by default most often, I've noticed at least, is an inner join. So, an inner join um is basically um or the type of join uh will define how uh your rows are filtered. So, in this case, if you do an inner join, it's going to look at the rows that are in, let's say, in this case, the left table and the rows that are also present in the right table. So, you can see here this pink one, it finds in the right table and in the left table. So, that data is going to go to your result. And then this yellow four, L4, um is going to be in both tables as well, which means it goes to the end result. There's more rows in both tables, but because they aren't matched, they can't be found in both tables, they won't go to your end result. So, the thing to remember, inner join uh only looks at rows that are in both tables. Um as I mentioned, uh AI prompting tip, I put some tips here uh based on stuff that I've learned in the past. Um it typically defaults to an inner join, uh which is also a join. So, if you see join, just remember uh it has the functionality of an inner join. It's the same thing. So, let's give it a go. Let's ask the AI um to combine the data in both tables. So, what would happen if I would say combine the data from animal rescues and census data. So, let's just put it like this and see how well the AI does with this. Okay. Let me accept and let's run that. Let's see if that runs. An error. So, what did the AI do? Uh let's have a look. Did it get the AI okay. So, what happened here is I specified the name of the table as animal rescues data, where it's actually animal rescues summary. So, uh it wrongly put this table here, which means um it doesn't uh recognize it. Rather than me updating uh the prompt, let me actually go back and show you. There's this fix and explain button here, and usually the AI is pretty good at figuring out its own mistake. So, let's see if it There we go. Pretty cool. So, it realized that the wrong table name was there. So, if I accept now, and we run this one. There we go, that's running. So, what SQL code have we generated? Uh we're selecting a bunch of columns. Um it actually specified uh for the census data, it specifically sets which columns it's selecting. For the animal rescue data, it's selecting all the columns from there. It's an interesting choice. Um but I guess that works. Um and then we're joining, which is as I mentioned the same as an inner join. Um with the census data on area code. So, when you're using joins, it's always uh important to specify the column that you're joining on. So, in our case, we knew area code um or area, both of them I guess uh are are similar between the two tables and we can join on that. Area code or just in general when you're uh joining stuff, uh codes are typically safer to join on than areas because it could be for example that um in one table, this is all lower case and the other table it's all upper case. Um area I mean then and then you wouldn't find a match. So, typically codes uh or IDs are uh the the way to go when you're joining. So, in this case, the AI figured it out perfectly uh all by itself um with the context that it has about our schema. If it doesn't, what you can always do in your prompt, put um specifically the name of the column that you want to uh join on. So, if you would have put something like this, um that would work as well. In this case, it updated it uh to use the star now. Um so, yeah. It gets the same result. You'll notice since it's non-deterministic, the AI sometimes yeah, returns different results if you're not specific enough with your ask. Um the last thing I want to mention here is uh as you noticed, there's these small um aliases is what they they're called in SQL uh behind the tables. Um So, the AI in this case put or named the animal rescue data AR and the census data CD. Uh so that it then can easily refer to uh which table it wants to get the area code field from later on. So, um best practices or uh what what what I like to do as well is make these as descriptive as possible, which makes your SQL code um nice to read. Um so, if you want to be more specific about your aliases, you can pass it in your prompt as well and you can tell the AI this is what my table should be named. Uh on the other hand, you could uh rename these yourself. Um so, there we go, our first join. So, you can see now we have the area code and area with the summary data from the animal rescues. And then here after that, we have again area code and area this time from the census data table together with all the demographic uh data there. So, that's pretty cool. We've combined um the two tables. Um But there's a lot of that we're looking at at this point, so it probably makes sense to like limit the selection a bit. Um and also to give some clear aliases just to to follow the the best practice uh guidelines. So, what we'll do is exactly that in the next step. Um so, one cool thing you can do then is if you copy your SQL code, then we can iterate on that. You don't have to start from scratch. You can say, "Hey AI, here's the context from what I've been doing so far. Iterate on this." So, now I don't have to tell the entire thing of combine the animal rescue data with the census data, but I can just tell it to make a certain set of changes. So, in this case, I'll tell it alias the um animal [clears throat] Let's try to be a bit more vague and don't use the entire table name just because it's there already. So, it should be able to deduct that. Alias the animal rescues data as uh let's use the rescues areas and the census data as let's say just census. And let's run that. There we go. Uh the problem is it up Well, it updated um No, it didn't actually. It aliased uh it aliased it in a wrong way. So, it aliased the columns. So, something like this uh wouldn't work. So, we would have to be specific in telling it um that we wanted to alias the tables because what it did here is it aliased the fields, which is are two ways to use an alias or describe an alias in um in SQL. So, let's see if this one works. There we go. Okay. Did it perfectly now. So, we're aliasing our tables as rescues and census and as you can see in the on statement, uh the new aliases are used. And then we said as well, we want to select only uh certain type number of fields. So, I want to select the area from the rescues table, which is the same as the incidents table. Rescue or an incident are used interchangeably. And the uh number of incidents >> [snorts] >> and Yeah, there's a typo there, but I'll leave it uh because it should be able to figure it out. Number of residents. Let's see if I'm too optimistic about what the AI can do. There we go. That seems correct. Yep. Nice. So, this is a lot easier to read, right? Now we only have the three tables and we can start interpreting what we see in front of us. So, we've selected the area, uh the name of the area, which is uh easy to read now. We have the number of incidents, which is the number of rescues for each of these areas, and the number of residents uh as we know it from our census data uh all in one neat place. Pretty cool. Um let's refine that further because what would be cool now is if we could actually order them, which are um the areas that have the most incidents, for example. So, what I'll do is once again, I'll copy my SQL code. Don't have to start from scratch. I can iterate on what I have here. I could just as well, by the way, um just iterate here in the prompt, but I'm putting new um or I'm doing it in new cells just to keep um like uh the previous output available as well. So, here in the AI, I can now tell it to order by the most incidents. And it will now do it in descending order as well, which is pretty cool. Or it makes more sense, I guess. Uh since we were trying to answer the question about which one had the most incidents. And then we can see Southwark. I have no idea if I'm pronouncing that correctly. Sorry, people from the UK if I'm not. Um but uh that's the one that had the most incidents in uh 2011. Um nice. Pretty neat. Uh that was our inner join. So, I mentioned before, we have different types of joins. Inner join matches the data if it's present in both tables. So, what will have happened now is uh since in the rescue data, we only had London uh data and in the census data, we had the entire UK, uh it's only looking at um London areas. But what if What if we want to get the demographic data as we have it from the census in uh as we have from the census data and enrich it with incidents and the animal rescues if it's available, if it's a London area, but otherwise um just leave it leave it blank. There is a way to do that. Um as you might uh expect. And that is a left join. So, quickly walking through the difference between an inner join and a left join. So, inner join, it needs to be present or the rows need to be present. The uh the row that or specifically, I should have mentioned that before, I guess. The the row that or the field that you're joining on needs to be present on both sides. Um in this type of join, a left join, we'll need to specify the left table first. So, that's the first one in is going to be the first one in our statement. And we'll take all the rows from that table. And then we look in the right table, so the table we're going to join on. Uh if there's a match with the left table, then we'll add the information. Otherwise, we won't. So, walking through this visualization, the pink one, ID one, that's present in both both tables, which means we get the value from the left table, which is L1, together with the value of the right table, which is R1. Then, let's look at ID number two. The value L2, we'll add this one here. But there's no match for this one. There's no blue two here. Which means the value will be null. So, uh we can ensure by doing this that we take all the data from the left table and this is a good use case if you don't want to accidentally leave some rows out, but you know for sure the left table's what we want. So, if we would apply that to our um example here, um then uh we would put our left table uh the overall table, the table with most of the areas, which is our census data, and I will join the right table, which is the rescue data on only in cases if there is a match. And then we'll ensure that we keep all the demographic data from all UK boroughs. Uh the AI prompting tip is here. Here is sorry since inner join is typically the default, you need to use some language to specify um that you want to use a left join. Uh you can use words like all records from or everything in table A. And typically the AI is able to to figure it out by using those words. So, let's give it a go. So, the prompt will become um combine everything of or let's say all rows from um in this case census data with uh the animal rescues summary on area code. I'll be specific and mention that here. Let's see if that will trigger a left join. Nice. Okay, there it is our left join. It by default selected a bunch of columns again, which I'll leave as is and let's just run this code. Okay, so once again, we have combined the data from our census data, which now is first since that is our table here, the first one that we want everything from with the [snorts] data from the animal rescues table. So, you can see there's a bunch of empty values here. If I scroll, so some of them will have data. So, if I would go back to the area There we go. So, these are three London areas, boroughs, and those will have data whereas the other ones don't. Imagine I think at least that was my next question. We want to order this and make it a bit more clear um to interpret. Let's do exactly that. So, once again, let me copy what I did before. There we go. And then let's select the area. Actually, I can copy this as well. Here we go. Now, let's see. What if I just copy my entire instruction? That I guess is a big time saver if you have your data ask written down somewhere. Okay, what it's doing is it is selecting a bunch of of columns that I specified and then it's adding a where statement. So, let me accept and run. Nice. So, we're now looking only at the columns that were specified. This bit has stayed the same as what we asked it as how it was when we asked to do the combination before. And then it's added a where clause, which is filtering. So, in this case, it's looking where is the area code null. So, all of the data that's going to be surfaced now will be areas that are not in London. So, in other parts of the UK. Imagine you only want the London ones again. You could ask the AI, but some things are if you know the syntax um a bit quicker just to just yourself. Here you go. And then we have the at the other way around. So, these are all our areas in London with the two metrics combined. So, that's pretty neat. Maybe Reese, before I move on, I think we're more or less at the halfway point. Are there any questions already for me to answer or should I just continue? Yes, we do have a couple questions. Let me see if there's any that are relevant for right now. Oh, actually there there was a good one that Zadoff asked. So, um he said could anybody clarify when to use joins and when to use subqueries. I thought that was that was quite a relevant one. Yeah. No, it's it's a it's a good question. They each have their own use cases. What we're what you're doing with subqueries I'll say personally, it's not my favorite because it's not the like the most readable if you're you're nesting stuff, but subqueries or CTEs, which is a different way to do it, is where you write a query and then use that query like for example, this entire select from where statement. Let's lose the joins and then use that data to continue in other places with the data. So, depending on the use case, it could achieve the same, but the way to think about it for me at least is joins is really combining two data sources and with subqueries, you can continue like with different pieces of the data as you as you go on. It's yeah, much more diverse I'd say like what you can do with subqueries or CTEs than than what you can do with joins. That makes sense. Got it. Yeah. No, that's perfect. And then I think we had a data lab specific question from from Tom. So, Tom says in data sources we can export, can we download the post grace database or do they stay in DC? As far as I'm aware, they stay in DC, right? Or Yeah, you could if you wanted the tables, you could do a select star from the table and then there's this button here that you can press, which would allow you to download it as a CSV. Uh but in terms of like the entire database and go to set it up, I yeah, I don't think that's that's possible. Got it. Okay, yeah, I think we'll we'll head back in and we'll we'll head to the rest of the questions at the end of the session. Yep, that makes sense. >> Perfect. Cool. Let's get going and there's two more tasks I think before we reach the end. Yes, so we have our data, we've joined it. And as you've noticed so far, we're just selecting, but true value or the power of combining data sources is being able to calculate metrics on top of this. So, rather than just looking at number of residents and number of incidents side by side, we could calculate some metrics. For example, the number of incidents per resident to get some ratios or metrics like that. And that's exactly what we're going to be doing now. Uh we are going to calculate some metrics. In this case, it would be um cool to see how much money is spent per on rescues per resident. So, there is a a column in our rescue data, which is called average cost. Let me go up so we can quickly have a look at where we explored the data. So, there's an average hourly cost and there's also the number of hours, the total number of hours by region. So, if we would multiply these two, we would have the total cost per region in 2011. And if we would then combine that data with the number of residents that we know from the census data, we would be able to calculate the cost. So, let's do exactly that. The prompting tip I have for you here is um AI isn't always it depends on the metric, but it isn't always the best if you stay vague and say something like calculate the cost. Um but if you're more specific and telling it exactly like I was walking you through how you could do this calculation. If you tell it exactly how to do the calculation, then it will be much better. But let me try and do the vague way and let's see let's see what the AI comes up with. So, um let me rather than having to do the the joining again, let me start from here. So, I'll just select my joint data. There we go. Which is everything. Uh and let's then go from there. So, in this data now we have everything. We'll notice actually, it makes sense to keep the um the filter here. So, let me undo that. There we go. Let's only look at places where we have the data in both sides. What would be the other way to do that? Now that we're talking about SQL, there's the other type of join. So, if rather than this, I would do inner join and I know now we're not letting the AI generate it, but just for I guess demonstration purposes, um you could do an inner join and then we're back um at where we were during our first section. Uh so, sorry, back to what we were doing. So, what if I tell it calculate the cost per or let's let's say really vague. Let's say the average cost. I would expect that it wouldn't be able to figure it out from here. No. What it's doing is it's just looking at the animal rescue data and then averaging the cost there, which I guess makes sense because I didn't mention anything about residents. So, let's say we try and do something a little bit better and we say the average cost per resident. Let's see what it does with that. So now it's taking the average hourly cost and dividing it by It's doing something with uh the residents, which is uh about handling null values. I don't know why it added this specifically, but I guess it thought it needed to do that. Uh probably because division. It suspects there might be division issues. But anyway, so average hourly cost dividing by number of residents also wouldn't be our metric, so let's be even more specific. So let's say the average cost per resident, the cost is calculated by uh multiplying the um average hourly Also, I won't use the exact table name uh field names, and let's see if it can figure it out by the um total service uh hours. Let's try this. Okay. Now we're getting somewhere that starts to look uh correct. Let me accept and run this, and let's have a look. Here we go. So what it's done now, it's selected a bunch of columns, uh and then uh I guess each of the components that would make up the metric as well, which are these three fields. Then it's calculated the total cost, which is the hourly cost times the service duration, and then um our final metric, which is the one we're interested in, would be uh us dividing that number by the number of residents. So let's have a look uh for each of the regions. We have the units supplied, duration, average hourly cost, our total cost, and then the average cost per resident. Nice. Um if uh yeah, let's clean that up a bit. So while it's I'll leave this Yeah, I'll leave this, but let's just look at the area. So let's delete all the fields we don't need. And then just do it like this. I could have asked the AI to do it, but um sometimes uh it's common sense, I guess, like uh if you can make some syntax changes quickly yourself. Um there we go. So we have our metric. It would be cool to see now what which one or which of the areas has the highest average cost per resident. Uh and we're doing exactly that here. So what I'll do is I'll copy what I have. Uh and then let's ask the AI to clean that up. Um show the average cost per resident um with two decimals and order by it's in descending order. Something like this. And we're waiting. Okay, so it's added the round function, uh which correctly will round it to two decimals, and it's added this order by. So it's using the metric we just calculated and ordering in a descending order. That all seems to be correct. Nice. Okay, there we go. City of London. 14 cents per resident in 2011, uh which is three times more than the one below. Which does make me think what if we would put um our number of residents here. Um ba ba bum. Let me just copy this. Just trying to see what uh the breakdown is like. Why is this metric uh this high? Uh there we go. Number of residents is very small. I guess that makes sense. City of London uh probably has a lot of uh commercial um venues and stuff like that, and very little actual residents as they would appear in the census data. So even the smallest number uh of incidents or costs will will pull this number up. Okay. Nice. We know that now, I guess, in case anybody was wondering. Um Then let's do something a little trickier. Uh what if um instead of doing it by area, we'll do it by region. So remember in the census data, the regions the areas were grouped at a higher level as well, which is the region. Can we calculate this cost uh at that level as well? Uh let's the AI calculate Actually, I'll copy it over again. I guess I could start from scratch, but I'll I'll do this to give it the context on the calculation. Calculate >> [snorts] >> the average cost per resident by Let me just say by region and see what it does. Okay. Actually, it figured it out perfectly, because what's happening here is uh you're going to need a group by statement. Because before we were joining um the rows at the region level Sorry, at the area level. Now we need to go a level up, so we'll need to group it, and then we'll need to add um uh the aggregation functions to do this calculation as well. So let me accept and run. Okay. That looks good. So it correctly added the aggregation functions, which is uh sum uh everywhere. Yep. And we're ending up at a two cents average uh for London. So remember we only have London data, so in this case there's only one row. Um all the other regions we don't have the the incident or the the cost uh data for, so that's why only one row will show up. So we know here now we could compare it like we know the overall values. Uh two cents is the average, so there's everything from row 13 and above would be above uh the average there. >> [snorts] >> But yeah, pretty impressive of the AI that it could figure out this as well. Um let's move on to the last task. Uh and what I want to cover here uh is filtering. So there's two ways to filter. Um And they achieve slightly different things. There's uh a where statement, which I believe you used if you were at yesterday's session was used there as well. We added it earlier in this session where we were Yeah, I won't scroll up. Uh but when we were filtering for the the null rows. Um so that's one way of filtering, but that way you can only use before you do grouping. So as soon um as you start using a group by statement, you need to start thinking uh what type of filtering you want to use. So where filter, while you could still use it with the group by, it filters before you are grouping. Uh having, which is the other type of filter, filters after uh you've have have aggregated or grouped. So um using the example that you would see here, uh imagine you have some sales data for different regions. Um you can filter Or actually, let me let me use it with our data. It's probably a lot more uh intuitive than doing that um with a different data set. So if we would look at this example, we can't use a where statement um to filter on uh region Sorry, on area. So uh let's have a look. If I wanted to filter out specifically Sutton, Westminster, like any of these areas areas, I could do that with a where statement. Uh but imagine I want to filter based on uh grouped metrics, the group metric that I just created. In this case there's only one row, so it wouldn't make sense, but imagine I want to use this number. Um I would need to use a having statement. So the important thing to remember is um do you want to filter before aggregating or after aggregating? And it will become a bit more clear in the examples we'll go through in a minute. Um the AI prompting tip I wanted to mention here is um you can if Usually it's pretty good at filtering figuring out uh which type of filter it needs to use, but you can use some natural language to prompt it to use a certain clause. Um so individual rows, for example, is something that you can use to trigger where clause. Or final calculated results, final grouped results, final aggregated results are all words you can use to trigger a having clause. Um So let's do exactly that. What we want to do now is filter for UK boroughs with more than 80,000 residents between 0 and 14 years old. This is something we should be able to answer with just the census data. We don't have any animal rescue incident data that would need to be related to this. Let me just um copy that in. So, I'll say filter census data. There we go, for UK boroughs with >> [snorts] >> more than 80,000 residents between 0 and 14 years old. And let's have a look. So, for some reason it added this clause where area like borough, um which doesn't make sense in the context of the data that we have. So, I will just delete that. We'll just take all the areas. Then, what it's doing is saying where the number of residents is not null, which technically, I guess it's correct as well. But, once again, we know from our data it's pretty clean and there are no no null values in there. So, uh yeah, I'll leave it in, but you wouldn't really need to do this. Um but, there we go. Um what we're getting now is for this variable, and the AI figured out pretty neatly which one we need to use. Uh only give me values over 80,000. And you can see all nine rows are over 80,000. So, this case a where clause was what we needed. Let's have a look at the next questions. So, now we're going to aggregate and then filter, which uh if you remember case, we need to use a having clause. Let's um Yeah, let's just put it uh let's let's make it fake and and not tell it too much and see how well the AI does. So, uh find, I'll say uh re- regions in the data. Uh indeed, because we want to look at regions. Uh find regions in the census data with more than 100,000 residents over 65 years old. So, now what it will need to do is it will need to first um sum at the region level and only then apply its filtering, because if you're if it's going to do the uh filtering at the area level, it will be uh incorrect. There we go. So, we have our census data. It's grouping by region. It's putting the region here and then summing the population over 65, which uh not sure. Yeah, that's not entirely the correct field name. It needs to be age over 65, but we can um correct that. But, other than that, it's looking pretty good. So, let me accept. And let me change this to age over 65. So, these are things that AI can can get wrong from time to time if you don't specify. Um but then, but I believe this should run. Yep. There we go. So, now we should only have uh regions uh above 100,000, which seems to be correct. To make it a bit more clear, let's update our query and let's order it. Um so, let me first order. Order, let me just say appropriately. What I want is on a scenario. Let's just say let's make it clear. Let's say order in descending order. Uh yeah, there's only one metric. I think it will figure out that it needs to order uh those Yeah, exactly. Needs to order the numerical value that we have there. Nice. So, there you go. And then, uh as a last thing, uh imagine we want to exclude the city of London. We know it's like a commercial area. We don't want to count uh any residents uh that are in that particular area. Whatever reason would be um valuable to do this, um >> [clears throat] >> then uh we'll notice that in that case, it should trigger where clause, because we're filtering at the area level. So, let me just tell the AI filter out city of London. Yep. And it is perfectly using the where clause to do it. So, now you should notice that the London number uh should be slightly lower than what we had before, because we're filtering this out if there are at least any people over 65 there. Um and that [clears throat] is a beautiful combination of group by and where sorry, of having and where uh all in the same SQL query. That was it for today. I just wanted to mention uh or reiterate, I guess, when you are using AI, um first step deconstruct to construct. What I've been doing throughout this whole uh lesson is asking it to generate some SQL code and then iterating on the code. So, I'm just asking it to do bits at a time. Don't put a five-sentence prompt and then expect it to do everything at once. Um it it can in some cases do that, but uh I've noticed um iterating is the way to go. Be the editor. Think about what you're looking at. So, if you see field names are off, it's hallucinating um yeah, column names as it says here. Be aware that that it can get that stuff wrong. So, have a look at your SQL code and change it if you spot errors, which brings me to the next one, the error loop. Um we use it once, I guess, in this code along. But, if the syntax is wrong, it's generating an error. AI is pretty good at figuring it out its own um problems or errors. So, um just feed it back in and it can help you. And then, the last thing, context is king. So, um always tell uh the AI the context of what you're trying to achieve, what your tables, your columns are, and the better the context, uh the better the code will be. And that was it for me today. Any more questions, Reese? Perfect. Very slick. Um yes, yeah, we've got a few more questions. And for everyone else, if you have thought of a question but you haven't asked it yet, now is the time. We're going to be running through them. Uh but yeah, we've got some good ones from the audience. So, um where to start? That's the that's the real question. Uh so, we've got a couple on choosing when to use specific joins and things like that, but I won't start with those. Um Actually, uh before I get to the questions, I actually highlighted two comments that um that I wanted to read to you from the audience. Uh so, we had this one from uh JD on YouTube. He said, uh "I've been struggling with the more advanced SQL courses, and these sessions are really helping me go over the hump." So, yeah, glad you're getting some value out of it, and thank you, Sarah. Um and second follow-up from from JD as well. I like how Sarah uh starts out with the vague prompts and gets more specific. You can see how the AI generates better uh results the more specific you get. Is that something that you you baked into this when you were when you were planning the session? Yeah, a little bit. I guess I could have gone with uh the full-fledged prompt from the start. But, just like in real life, you don't always know. Uh you learn, I guess, over time what what the things are to pay attention to. But, it's the process as it will look like for me in in most cases. Yeah. >> [snorts] >> Cool. Um we've got a question from Rich. Hopefully, this one's uh pretty straightforward. But, he says, "How can we get a our results table to um left-justify results rather than having results on the right-hand side?" So, I guess on that table that you've got on on screen at the moment. Yeah, what how how can we fix that? Uh so, the question is left-justify >> the results. results? In that um total over 65 column, can we get those results left-aligned in that column? Is that simple? Uh you mean like actually putting this column on the left and the region column on the right? Or I think it just means getting the um the entries in the cells left-aligned to that column in the same way that the column header is is left-aligned. Ah, okay. Okay. Okay. Oh, sorry. Sorry. Got it. Um foof, uh I don't think Data Data Data Lab uh would uh allow that functionality. I don't think so. Maybe in different tools it would allow you to do the formatting. It could be. Uh but, I don't think it's possible. Okay. One one for a future code along, then. Um yeah, maybe one where we're not using Data Lab. Um cool. So, I'll move on. So, uh got another one from Tom. So, Tom says, "Are there ways to add context when outside um of using Data Comes AI when using AI in your own work or databases?" So, yeah, what when when we're adding context to the AI AI, what are we doing there? Yeah, honestly, it's it's very dependent on the tool uh that you're using. But, there are ways. So, uh if if you're very simply and like uh roughly, let's just say you're using a tool like ChatGPT, you could start off your conversation just by putting the um the context of the the tables and the columns, maybe a data snippet even, uh and and just starting from there. Uh because it will remember uh what happened in that conversation, and like the SQL code it generates will have the context of what is happening there. Um if you would set up uh your own uh in Data Lab your own database connection uh with just the tables you need uh in the schema that you're looking at, uh the AI is also programmed to specifically look at uh the schema of the the data source connection you have here. In In this case, uh you noticed it was pretty good in most cases to um like uh use the correct column names, but there's a bunch of data in this project's database, which is why I think it's it struggles to do it like exactly right, but um that's the way like some tools uh like DataLab um are set up that way that it looks if you connect to the database schema at exactly what is in there. Um and there's other ways like if you're uh yeah, if I'm trying to think there's um custom solutions that you could provide like the context of the tables that you're working with to and just say like in all of the questions I ask you uh AI um use use this as context. So, uh for sure a lot of of possibilities there. Cool. Um we're kind of light on questions. So, yeah, if anyone's got any questions, we've got 5 minutes left and we can get through to those. Um we had a a comment earlier on in the session. So, um A Freelancer on YouTube said, uh you're working with data frames. I thought all Panda data frames were for Python and databases and databases with tables are for SQL. In a real environment, could you run SQL queries on Panda data frames? Is that a possibility? Yep. So, um you noticed correctly it says here data frame and the reason this is a DataLab functionality. I think other notebook tools do this in a very similar way, but what's happening here is uh while all of this like is um doesn't have any data frame or Python related to it yet, the output of this cell is going to go into a data frame. So, Q3 it's called. So, what I could do is I could create a new cell, a Python cell here, and I could work with my uh Q13 data frame and do any type of Pandas manipulation um on this. So, yeah, it's it's really way to start writing some SQL and then be able to use the output of that in Python to do much more complicated depending on what you want to do tasks. So, um yeah, that's what's happening here and lots of notebook tools will have similar functionality. Cool. Um we had a couple comments from people saying they weren't sure when to use a specific join. And there was also another comment from Sudha saying uh the difference between a cross join and a a union all. I know this one can be a little bit conf

Original Description

Code Along with Us in DataLab! https://bit.ly/4qF0KL2 Resources (including link to code along notebook): https://bit.ly/3ZIHdNP Register for this session to get the recording and resources sent to you! https://www.datacamp.com/webinars/ai-assisted-sql-joins Sara Billen, a Senior Data Analyst at DataCamp, will introduce the core principles behind database joins and show you how to write SQL join queries using AI assistance. You’ll learn how to prompt effectively when generating data analysis code and apply your skills to a real-world analytics scenario.
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DataCamp · DataCamp · 0 of 60

← Previous Next →
1 SQL Server Tutorial: Date manipulation
SQL Server Tutorial: Date manipulation
DataCamp
2 R Tutorial: Intermediate Interactive Data Visualization with plotly in R
R Tutorial: Intermediate Interactive Data Visualization with plotly in R
DataCamp
3 R Tutorial: Adding aesthetics to represent a variable
R Tutorial: Adding aesthetics to represent a variable
DataCamp
4 R Tutorial: Moving Beyond Simple Interactivity
R Tutorial: Moving Beyond Simple Interactivity
DataCamp
5 Python Tutorial: Why use ML for marketing? Strategies and use cases
Python Tutorial: Why use ML for marketing? Strategies and use cases
DataCamp
6 Python Tutorial: Preparation for modeling
Python Tutorial: Preparation for modeling
DataCamp
7 Python Tutorial: Machine Learning modeling steps
Python Tutorial: Machine Learning modeling steps
DataCamp
8 R Tutorial: The prior model
R Tutorial: The prior model
DataCamp
9 R Tutorial: Data & the likelihood
R Tutorial: Data & the likelihood
DataCamp
10 R Tutorial: The posterior model
R Tutorial: The posterior model
DataCamp
11 R Tutorial: An Introduction to plotly
R Tutorial: An Introduction to plotly
DataCamp
12 R Tutorial: Plotting a single variable
R Tutorial: Plotting a single variable
DataCamp
13 R Tutorial: Bivariate graphics
R Tutorial: Bivariate graphics
DataCamp
14 Python Tutorial: Customer Segmentation in Python
Python Tutorial: Customer Segmentation in Python
DataCamp
15 Python Tutorial: Time cohorts
Python Tutorial: Time cohorts
DataCamp
16 Python Tutorial: Calculate cohort metrics
Python Tutorial: Calculate cohort metrics
DataCamp
17 Python Tutorial: Cohort analysis visualization
Python Tutorial: Cohort analysis visualization
DataCamp
18 R Tutorial: Building Dashboards with flexdashboard
R Tutorial: Building Dashboards with flexdashboard
DataCamp
19 R Tutorial: Anatomy of a flexdashboard
R Tutorial: Anatomy of a flexdashboard
DataCamp
20 R Tutorial: Layout basics
R Tutorial: Layout basics
DataCamp
21 R Tutorial: Advanced layouts
R Tutorial: Advanced layouts
DataCamp
22 Python Tutorial: Time Series Analysis in Python
Python Tutorial: Time Series Analysis in Python
DataCamp
23 Python Tutorial: Correlation of Two Time Series
Python Tutorial: Correlation of Two Time Series
DataCamp
24 Python Tutorial: Simple Linear Regressions
Python Tutorial: Simple Linear Regressions
DataCamp
25 Python Tutorial: Autocorrelation
Python Tutorial: Autocorrelation
DataCamp
26 R Tutorial: The gapminder dataset
R Tutorial: The gapminder dataset
DataCamp
27 R Tutorial: The filter verb
R Tutorial: The filter verb
DataCamp
28 R Tutorial: The arrange verb
R Tutorial: The arrange verb
DataCamp
29 R Tutorial: The mutate verb
R Tutorial: The mutate verb
DataCamp
30 R Tutorial: What is cluster analysis?
R Tutorial: What is cluster analysis?
DataCamp
31 R Tutorial: Distance between two observations
R Tutorial: Distance between two observations
DataCamp
32 R Tutorial: The importance of scale
R Tutorial: The importance of scale
DataCamp
33 R Tutorial: Measuring distance for categorical data
R Tutorial: Measuring distance for categorical data
DataCamp
34 Python Tutorial: Plotting multiple graphs
Python Tutorial: Plotting multiple graphs
DataCamp
35 Python Tutorial: Customizing axes
Python Tutorial: Customizing axes
DataCamp
36 Python Tutorial: Legends, annotations, & styles
Python Tutorial: Legends, annotations, & styles
DataCamp
37 Python Tutorial: Introduction to iterators
Python Tutorial: Introduction to iterators
DataCamp
38 Python Tutorial: Playing with iterators
Python Tutorial: Playing with iterators
DataCamp
39 Python Tutorial: Using iterators to load large files into memory
Python Tutorial: Using iterators to load large files into memory
DataCamp
40 SQL Tutorial: Introduction to Relational Databases in SQL
SQL Tutorial: Introduction to Relational Databases in SQL
DataCamp
41 SQL Tutorial: Tables: At the core of every database
SQL Tutorial: Tables: At the core of every database
DataCamp
42 SQL Tutorial: Update your database as the structure changes
SQL Tutorial: Update your database as the structure changes
DataCamp
43 Python Tutorial: Classification-Tree Learning
Python Tutorial: Classification-Tree Learning
DataCamp
44 Python Tutorial: Decision-Tree for Classification
Python Tutorial: Decision-Tree for Classification
DataCamp
45 Python Tutorial: Decision-Tree for Regression
Python Tutorial: Decision-Tree for Regression
DataCamp
46 Python Tutorial: Census Subject Tables
Python Tutorial: Census Subject Tables
DataCamp
47 Python Tutorial: Census Geography
Python Tutorial: Census Geography
DataCamp
48 Python Tutorial: Using the Census API
Python Tutorial: Using the Census API
DataCamp
49 R Tutorial: A/B Testing in R
R Tutorial: A/B Testing in R
DataCamp
50 R Tutorial: Baseline Conversion Rates
R Tutorial: Baseline Conversion Rates
DataCamp
51 R Tutorial: Designing an Experiment - Power Analysis
R Tutorial: Designing an Experiment - Power Analysis
DataCamp
52 R Tutorial: Introduction to qualitative data
R Tutorial: Introduction to qualitative data
DataCamp
53 R Tutorial: Understanding your qualitative variables
R Tutorial: Understanding your qualitative variables
DataCamp
54 R Tutorial: Making Better Plots
R Tutorial: Making Better Plots
DataCamp
55 SQL Tutorial: OLTP and OLAP
SQL Tutorial: OLTP and OLAP
DataCamp
56 SQL Tutorial: Storing data
SQL Tutorial: Storing data
DataCamp
57 SQL Tutorial: Database design
SQL Tutorial: Database design
DataCamp
58 Python Tutorial: Introduction to spaCy
Python Tutorial: Introduction to spaCy
DataCamp
59 Python Tutorial: Statistical Models
Python Tutorial: Statistical Models
DataCamp
60 Python Tutorial: Rule-based Matching
Python Tutorial: Rule-based Matching
DataCamp

This video teaches viewers how to use AI-assisted SQL joins in data analytics, specifically using DataCamp's Data Lab and ChatGPT to generate and refine SQL queries. Viewers will learn how to perform data analysis using SQL joins and how to use DataCamp's Data Lab for data analytics.

Key Takeaways
  1. Add data to files in Data Lab
  2. Connect to public databases in Data Lab
  3. Set up a Postgres SQL base in Data Lab for private data
  4. Use an iterative process when working with AI to code
  5. Click AI button to access prompting window
  6. Type 'show me the rescue data' to query table
  7. Pass table name to provide context
  8. Use inner join to link tables
  9. Run SQL code to display results
  10. Alias tables AR and CD
💡 Providing context is key to getting accurate results from AI-assisted SQL joins

Related Reads

Up next
How AI, MCP & Tableau Extensions Are Transforming Analytics
Salesforce Product Center
Watch →