Scraping Web Datasets for Data Science Projects
Key Takeaways
The video demonstrates how to scrape web datasets for data science projects using Bright Data and analyze the data with Python, utilizing tools like ChatGPT, nltk, and textplop for natural language processing and sentiment analysis.
Full Transcript
we are going to use a massive 1.3 million record data set that was collected automatically using bright data and then we're going to use chatgpt to generate code for us to analyze that massive data set now in one of my last videos I showed you how you can automate machine learning using chat GPT but we all know that data scientists spend around 80 percent of their time on Gathering cleaning and preparing data for modeling so I wanted to figure out what we can do if we combine the powerful data collection features from Bright data and then use chatgpt to generate code for the data to analyze it because this way we could potentially do a whole data science portfolio project in just a few hours and I found that they have a very large data set with Amazon product reviews and I am currently in the market for new headphones to use at my other office and I thought it would be interesting to see if we can leverage data science techniques to figure out the best headphones for me based on my requirements and preferences so I reached out to Bright data and asked if they could provide me with a data set of headphone product reviews and they were happy to collaborate with me on this video and provided me with a massive data set of 1.3 million headphone reviews so this is a data set with real reviews that were scraped from Amazon using bright data and in total there are 308 different headsets within this data set most of them having thousands of reviews so that is really exciting and that is what we're going to dive into in this video so let's see if we can load this up into a data frame and it seems to go just fine even though it's a pretty massive data set so with a timestamp product name rating we also have a review over here so this is in text now one thing that I noticed when I was looking at this data set is that most of the objects in there have quotes around them as well so it's essentially captured as a string so it would be a double quote and we basically want to get rid of that so let's see if we can ask jet GPT to write a function for us and let's see what it can do certainly okay here's a function remove quotes oh look I think it did a job that is awesome first job successfully completed so now that the data is in the correct format we can actually start to play around with this now again the goal is to find the perfect headphones for me and I basically have three criteria that I find important for my headphones first of all I want really good noise cancellation second I want them to be comfortable to wear all day and third I want them to have really good audio quality so let's see if we can ask jet GPT to extract useful information from the preview text column okay let's see what it can come up with alright so it's giving us this very thorough description of what we can do and it already started to mention something about NLP so we use natural language processing okay so chat GPT provided us with a five-step plan that we can follow to analyze the reviews let's now see if we can turn those steps into python code and I've found that if you're using jet GPT for coding you just have to be really specific really make it obvious what it is that you want instead of just asking to do all of this we're first going to ask like can you provide a python function for step one to clean the text all right and here's the output so it seems like a really neat function that we can use to perform basic text cleaning using the nltk library all also removing stopports let's see if we can apply this function to the whole column with all of the 1.3 million records okay this is actually taking quite some time okay and that is working but it took way too long so probably not the best idea to try this out on the 1.3 million records so let's first create a subset to validate our code and we'll take randomly 10 000 records yeah okay so we got it so we convert to lowercase we removed punctuation and then we tokenize the text remove stop words so here we are preparing the text with this function so we can analyze it later this is very common in natural language processing but normally it takes a lot of time to engineer such a function like this let's make it 100 000 and now this will take some time and now let me actually while we're waiting for this to finish let me show you how you can use write data as well to collect your data because this is actually really cool and you can do this in two ways which I will get into in a bit so you can use their data sets or their web scraper IDE but how this works and this is actually really cool they use a very large large proxy Network around the world that they use to basically Target websites from different proxies from all over the world because if you've ever tried to scrape websites with python for example you will run into things where they are going to block you and now like I said you can use bright data in two ways so you can use the data sets that are already available or you can use the web scraper IDE so if we go to the products over here you basically have the option so I can build a scraper over here and I can start basically from a template so YouTube or let's go with Amazon in this case and it will fire up a IDE that you can use to First create interaction code so how you interact with the page so scrolling clicking Etc and then the parser code and this is what data to parse from the website itself now this is in JavaScript and this is a really powerful tool but I'm not a good JavaScript and I'm also not really interested in building my own scrapers what I am interested in is using the data marketplace where you can browse the already existing data sets that bright ad has already provided so this is really cool so we have categories like business and e-commerce and social media and this is data that is very valuable to certain companies and here is the reviews that the data set that we are working on right now is from so you don't have to be a developer you don't have to be good at web scraping you can also just turn to the already existing data sets and at the end of this video I'll show you how you can get started with bright data as well and you can actually get uh 25 dollars in credits that you can use on the platform for free so stick around for that so I think by now we should be able to go so we have the 100 000 records over here so and now we want to identify keywords and here it's actually quite funny because it is suggesting another surface to come up with keywords but we're just going to ask chatgpt so we're going to create keywords so now for all of the criteria that we want to check the products for we're gonna come up with keywords that we're going to use alright so let's store these as well so just plug them in here alright looking good so these are the keywords that we're going to use to identify whether a review says something about noise cancellation audio quality or Comfort now let's check out step 3. create a function that will return only return rows that contain any of the keywords in the dictionary we've defined so that's basically what we want to do so this is actually a pretty clean function over here that if we run this we'll also run very fast and look we have a result of 15 000 records of the 100 000 that contain any of these keywords so that's actually really good so we got 15 already so here we limit the data set to only contain reviews where people say something specifically about noise cancellation Comfort or audio quality because those are the features we're interested in okay I'm actually quite curious what it can come up with all right and we have another cool function using textplop which we can use to perform sentiment analysis and calculate the polarity and the subjectivity of the whole review so let's take this come back over here alright so that went quite fast and now we can look at the columns that we have over here and we have a polarity and a subjectivity and now this is another common natural language processing technique to calculate the sentiment and the polarity of a given text where a higher sentiment score in this case means a more positive review and a higher subjectivity score means a more subjective review alright so now that we can calculate the sentiment scores for the individual reviews let's now ask jet GPT to provide us with the final code to bring together all these scores per product and then calculate the winner so we can copy these codes let's just put it in here it looks like it's calculating the polarity again so kind of redundant but let's just see if this works okay that is completed so let's see what we got over here and we have products and wow okay so it seems like for all of the categories noise cancellation comfort and audio quality it is now giving us polarity scores where the higher is better so now we want to check so for example this one is like ranking really high on Comfort let's see if we can like sum this and then filter it to basically bring the highest score to the top and that is when I stumbled Upon A Buck within this code so again always make sure of course to thoroughly check your code that is provided by chat GPT we're defining these scores over here but then We're looping over the data frame selecting a product and then eventually updating this dictionary over here with the product and the scores but since we're doing this in a loop and there are duplicate products within this data frame because each product can have many reviews we're basically overwriting the score constantly and ending up with a result that only contains the scores of the last review so I tweaked some of the Lines within this function to correctly calculate the scores and then eventually also divide by the total amount of reviews for that product again to keep it fair but now since I've had a little more time to run this code I was able to run it for the whole data frame of all the 1.3 million records after applying the keyword filtering we ended up with a data set with over 200 000 reviews containing any of these keywords so we are now going to use this updated function to calculate the polarity scores for all the relevant for fuse containing the keywords all right and this is the final result so we have a data frame with 289 products in there and we have a score for noise cancellation for comfort for audio quality we can see the total reviews we can see the brand and then also we have the URL so now all that's left to do is basically sum these values over here to calculate the total score and then sort the data frame to find out what the best headphones are alright so now we have to sort a data frame with the highest score on top over here and we can start to check out what the best headphones is based on this analysis but one thing that I see over here is that that one on top only has eight reviews so to make it a little Fair let's give this a filter and make sure that we at least have 50 reviews in there now something else that I noticed is that there are actually many earbuds within this data and I'm looking for headphones so let's create one more filter to get rid of the earbuds so let's apply a string contains filter look for earbuds and then in first the result alright and now for The Grand Review so let's check which headphones using this method scores the highest on the polarity score using the keywords that we've determined for our criteria alright so we got the URL let's open it up and the winner is the elec circuits headphones all right okay we can even choose some fancy colors yeah nice I love it it's foldable portable HD mic for Clear conversations share the fun together HD stereo beautiful sound this could be you seems like he's enjoying it and now please take these results with a little grain of salt because we have literally copied all of the steps that were provided by chat gbt only with little adjustments and we can probably come up with better ways to analyze this data and make a more accurate analysis alright let's actually have a quick look at the runner-ups so let's see the second one another kid's headphones again also really good reviews number three and another kids headphones for 15 bucks you can't really go wrong I would say so maybe it's just parents over here that are really happy that they can get an affordable headphone for their kids with decent quality so the results are in and now of course this was just me initially playing with this data set and seeing what jeans T could come up with but by asking some very basic prompts we were able to come up with a five-step plan to analyze this data set and then we created functions for each of the five steps that jet GPT provided us to do and now again please note that we've only considered the polarity of the filtered reviews based on the keywords so we haven't looked at Price we haven't looked at total amount of reviews and apart from some minor tweaks and some bugs this was actually a really cool approach that you can actually build out to make a really thorough sentiment analysis so I'm actually really happy with the results actually quite surprised so in this video I showed you how you can get data in minutes using bright data and then using chat GPT to generate the code and you can do this basically for any data that is on the web so if you want to play around with bright data and well and start collecting your own data sets you can use the first link in the description to create an account and then once you book a demo you can get 25 dollars in credits and now this is a great way to get a unique data set for a data science portfolio audio project so instead of using the Titanic or any other data set that everyone is using basically and recruiters are just not impressed by anymore you can use bright data to get your own data set and then use chatgpt to create your portfolio project but you don't have to tell that you're using GPT so you can just create really cool projects and speaking of portfolio projects if you want to see how I tackle an entire machine learning projects from start to finish that you can follow along as well then check out this video next where we create a fitness tracker using python
Original Description
Book a demo at Bright Data and get your own web datasets. As a special offer, you'll receive $25 in credits to start your data journey. Get started today: https://get.brightdata.com/datalumina.
In this video, I am going to analyze a massive dataset of 1,300,000 million headphone reviews with Python. The dataset was scraped using the Bright Data platform, and the code was generated by ChatGPT. This way, I was able to complete this data science portfolio project in a couple of hours. We'll be using various natural language processing (NLP) techniques to perform a sentiment analysis on the reviews to determine the winner!
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Dave Ebbelaar · Dave Ebbelaar · 31 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
▶
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
How to Install Homebrew on Mac (Getting Started)
Dave Ebbelaar
How to Install Python on Mac (Homebrew)
Dave Ebbelaar
How to Install Anaconda on Mac (Getting Started)
Dave Ebbelaar
How to Set up VS Code for Data Science & AI
Dave Ebbelaar
How to Use Git in VS Code for Data Science
Dave Ebbelaar
Data Science Desk Setup to Maximize Productivity
Dave Ebbelaar
THIS Is How I Write Clean Data Science Code EVERY TIME
Dave Ebbelaar
Data Science Tutorial - Project Structure
Dave Ebbelaar
Changing rcParams for Better Data Science Plots | Matplotlib Tutorial
Dave Ebbelaar
How to Read Excel Files with Python (Pandas Tutorial)
Dave Ebbelaar
My Data Science Journey (Zero to Freelance)
Dave Ebbelaar
How I Automate Data Visualization in Python
Dave Ebbelaar
16 Apps I Use Daily as a Data Scientist
Dave Ebbelaar
How to Manage Conda Environments for Data Science
Dave Ebbelaar
How to Export Machine Learning Models in Python
Dave Ebbelaar
VS Code Speed Hack for Data Science
Dave Ebbelaar
17 VS Code Tips That Will Change Your Data Science Workflow
Dave Ebbelaar
How to Predict the Future with Python (Forecasting Tutorial)
Dave Ebbelaar
How to Use Python Environment Variables
Dave Ebbelaar
7 Data Science Tips for Beginners in 2023
Dave Ebbelaar
How to Effectively Use the Data Science Lifecycle
Dave Ebbelaar
Full Machine Learning Project — Coding a Fitness Tracker with Python (Part 1)
Dave Ebbelaar
Full Machine Learning Project — Processing Raw Data (Part 2)
Dave Ebbelaar
Full Machine Learning Project — Data Visualization with Matplotlib (Part 3)
Dave Ebbelaar
This Will Change Data Science as We Know It (ChatGPT)
Dave Ebbelaar
Full Machine Learning Project — Detecting Outliers in Sensor Data (Part 4)
Dave Ebbelaar
Full Machine Learning Project — Low-pass Filter & Principal Component Analysis (Part 5a)
Dave Ebbelaar
Full Machine Learning Project — Fourier Transformation & Clustering (Part 5b)
Dave Ebbelaar
Full Machine Learning Project — Predictive Modelling (Part 6)
Dave Ebbelaar
Automate Machine Learning with ChatGPT
Dave Ebbelaar
Scraping Web Datasets for Data Science Projects
Dave Ebbelaar
Full Machine Learning Project — Counting Repetitions (Part 7)
Dave Ebbelaar
How to Use GitHub Copilot for Data Science (Python + VS Code)
Dave Ebbelaar
Every Beginner Data Scientist Should Understand This
Dave Ebbelaar
Revealing My New AI-Powered Data Science Workflow
Dave Ebbelaar
Auto-GPT Tutorial - Create Your Personal AI Assistant 🦾
Dave Ebbelaar
Build Your Own Auto-GPT Apps with LangChain (Python Tutorial)
Dave Ebbelaar
Building Slack AI Assistants with Python & LangChain
Dave Ebbelaar
ChatGPT Code Interpreter - Goodbye Data Analysts?
Dave Ebbelaar
How to Deploy AI Apps to the Cloud with Flask & Azure
Dave Ebbelaar
How to Build an AI Document Chatbot in 10 Minutes
Dave Ebbelaar
Is Falcon LLM the OpenAI Alternative? An Experimental Setup with LangChain
Dave Ebbelaar
GPT Engineer... Generate an entire codebase with one prompt
Dave Ebbelaar
Pandas DataFrame Agent... the future of data analysis?
Dave Ebbelaar
OpenAI Function Calling - Full Beginner Tutorial
Dave Ebbelaar
How to use ChatGPT's new “Code Interpreter” feature
Dave Ebbelaar
LangChain just launched their new "LangSmith" platform
Dave Ebbelaar
How I'd Learn AI (if I could start over)
Dave Ebbelaar
I Used AI To Scrape The Web & Write PDF Reports
Dave Ebbelaar
LangSmith Tutorial - LLM Evaluation for Beginners
Dave Ebbelaar
7 Lessons for New AI Engineers - Beginner’s Guide
Dave Ebbelaar
The Rise of the "New-Age" Machine Learning Engineer
Dave Ebbelaar
OpenAI Assistants Tutorial for Beginners
Dave Ebbelaar
How To Connect OpenAI To WhatsApp (Python Tutorial)
Dave Ebbelaar
How to Build Chatbot Interfaces with Python
Dave Ebbelaar
PostgreSQL as VectorDB - Beginner Tutorial
Dave Ebbelaar
My MacBook Setup (as a coder & business owner)
Dave Ebbelaar
Easiest Way to Connect AI Chatbots to WhatsApp
Dave Ebbelaar
ClickUp Tutorial - What Is ClickUp Brain? 🧠
Dave Ebbelaar
My Development Workflow for Data & AI Projects
Dave Ebbelaar
More on: Prompt Craft
View skill →
🎓
Tutor Explanation
DeepCamp AI