Scraping Web Datasets for Data Science Projects

Dave Ebbelaar · Intermediate ·🛠️ AI Tools & Apps ·3y ago

Skills: Prompt Craft90%LLM Foundations80%Advanced Prompting70%

Key Takeaways

The video demonstrates how to scrape web datasets for data science projects using Bright Data and analyze the data with Python, utilizing tools like ChatGPT, nltk, and textplop for natural language processing and sentiment analysis.

Full Transcript

we are going to use a massive 1.3 million record data set that was collected automatically using bright data and then we're going to use chatgpt to generate code for us to analyze that massive data set now in one of my last videos I showed you how you can automate machine learning using chat GPT but we all know that data scientists spend around 80 percent of their time on Gathering cleaning and preparing data for modeling so I wanted to figure out what we can do if we combine the powerful data collection features from Bright data and then use chatgpt to generate code for the data to analyze it because this way we could potentially do a whole data science portfolio project in just a few hours and I found that they have a very large data set with Amazon product reviews and I am currently in the market for new headphones to use at my other office and I thought it would be interesting to see if we can leverage data science techniques to figure out the best headphones for me based on my requirements and preferences so I reached out to Bright data and asked if they could provide me with a data set of headphone product reviews and they were happy to collaborate with me on this video and provided me with a massive data set of 1.3 million headphone reviews so this is a data set with real reviews that were scraped from Amazon using bright data and in total there are 308 different headsets within this data set most of them having thousands of reviews so that is really exciting and that is what we're going to dive into in this video so let's see if we can load this up into a data frame and it seems to go just fine even though it's a pretty massive data set so with a timestamp product name rating we also have a review over here so this is in text now one thing that I noticed when I was looking at this data set is that most of the objects in there have quotes around them as well so it's essentially captured as a string so it would be a double quote and we basically want to get rid of that so let's see if we can ask jet GPT to write a function for us and let's see what it can do certainly okay here's a function remove quotes oh look I think it did a job that is awesome first job successfully completed so now that the data is in the correct format we can actually start to play around with this now again the goal is to find the perfect headphones for me and I basically have three criteria that I find important for my headphones first of all I want really good noise cancellation second I want them to be comfortable to wear all day and third I want them to have really good audio quality so let's see if we can ask jet GPT to extract useful information from the preview text column okay let's see what it can come up with alright so it's giving us this very thorough description of what we can do and it already started to mention something about NLP so we use natural language processing okay so chat GPT provided us with a five-step plan that we can follow to analyze the reviews let's now see if we can turn those steps into python code and I've found that if you're using jet GPT for coding you just have to be really specific really make it obvious what it is that you want instead of just asking to do all of this we're first going to ask like can you provide a python function for step one to clean the text all right and here's the output so it seems like a really neat function that we can use to perform basic text cleaning using the nltk library all also removing stopports let's see if we can apply this function to the whole column with all of the 1.3 million records okay this is actually taking quite some time okay and that is working but it took way too long so probably not the best idea to try this out on the 1.3 million records so let's first create a subset to validate our code and we'll take randomly 10 000 records yeah okay so we got it so we convert to lowercase we removed punctuation and then we tokenize the text remove stop words so here we are preparing the text with this function so we can analyze it later this is very common in natural language processing but normally it takes a lot of time to engineer such a function like this let's make it 100 000 and now this will take some time and now let me actually while we're waiting for this to finish let me show you how you can use write data as well to collect your data because this is actually really cool and you can do this in two ways which I will get into in a bit so you can use their data sets or their web scraper IDE but how this works and this is actually really cool they use a very large large proxy Network around the world that they use to basically Target websites from different proxies from all over the world because if you've ever tried to scrape websites with python for example you will run into things where they are going to block you and now like I said you can use bright data in two ways so you can use the data sets that are already available or you can use the web scraper IDE so if we go to the products over here you basically have the option so I can build a scraper over here and I can start basically from a template so YouTube or let's go with Amazon in this case and it will fire up a IDE that you can use to First create interaction code so how you interact with the page so scrolling clicking Etc and then the parser code and this is what data to parse from the website itself now this is in JavaScript and this is a really powerful tool but I'm not a good JavaScript and I'm also not really interested in building my own scrapers what I am interested in is using the data marketplace where you can browse the already existing data sets that bright ad has already provided so this is really cool so we have categories like business and e-commerce and social media and this is data that is very valuable to certain companies and here is the reviews that the data set that we are working on right now is from so you don't have to be a developer you don't have to be good at web scraping you can also just turn to the already existing data sets and at the end of this video I'll show you how you can get started with bright data as well and you can actually get uh 25 dollars in credits that you can use on the platform for free so stick around for that so I think by now we should be able to go so we have the 100 000 records over here so and now we want to identify keywords and here it's actually quite funny because it is suggesting another surface to come up with keywords but we're just going to ask chatgpt so we're going to create keywords so now for all of the criteria that we want to check the products for we're gonna come up with keywords that we're going to use alright so let's store these as well so just plug them in here alright looking good so these are the keywords that we're going to use to identify whether a review says something about noise cancellation audio quality or Comfort now let's check out step 3. create a function that will return only return rows that contain any of the keywords in the dictionary we've defined so that's basically what we want to do so this is actually a pretty clean function over here that if we run this we'll also run very fast and look we have a result of 15 000 records of the 100 000 that contain any of these keywords so that's actually really good so we got 15 already so here we limit the data set to only contain reviews where people say something specifically about noise cancellation Comfort or audio quality because those are the features we're interested in okay I'm actually quite curious what it can come up with all right and we have another cool function using textplop which we can use to perform sentiment analysis and calculate the polarity and the subjectivity of the whole review so let's take this come back over here alright so that went quite fast and now we can look at the columns that we have over here and we have a polarity and a subjectivity and now this is another common natural language processing technique to calculate the sentiment and the polarity of a given text where a higher sentiment score in this case means a more positive review and a higher subjectivity score means a more subjective review alright so now that we can calculate the sentiment scores for the individual reviews let's now ask jet GPT to provide us with the final code to bring together all these scores per product and then calculate the winner so we can copy these codes let's just put it in here it looks like it's calculating the polarity again so kind of redundant but let's just see if this works okay that is completed so let's see what we got over here and we have products and wow okay so it seems like for all of the categories noise cancellation comfort and audio quality it is now giving us polarity scores where the higher is better so now we want to check so for example this one is like ranking really high on Comfort let's see if we can like sum this and then filter it to basically bring the highest score to the top and that is when I stumbled Upon A Buck within this code so again always make sure of course to thoroughly check your code that is provided by chat GPT we're defining these scores over here but then We're looping over the data frame selecting a product and then eventually updating this dictionary over here with the product and the scores but since we're doing this in a loop and there are duplicate products within this data frame because each product can have many reviews we're basically overwriting the score constantly and ending up with a result that only contains the scores of the last review so I tweaked some of the Lines within this function to correctly calculate the scores and then eventually also divide by the total amount of reviews for that product again to keep it fair but now since I've had a little more time to run this code I was able to run it for the whole data frame of all the 1.3 million records after applying the keyword filtering we ended up with a data set with over 200 000 reviews containing any of these keywords so we are now going to use this updated function to calculate the polarity scores for all the relevant for fuse containing the keywords all right and this is the final result so we have a data frame with 289 products in there and we have a score for noise cancellation for comfort for audio quality we can see the total reviews we can see the brand and then also we have the URL so now all that's left to do is basically sum these values over here to calculate the total score and then sort the data frame to find out what the best headphones are alright so now we have to sort a data frame with the highest score on top over here and we can start to check out what the best headphones is based on this analysis but one thing that I see over here is that that one on top only has eight reviews so to make it a little Fair let's give this a filter and make sure that we at least have 50 reviews in there now something else that I noticed is that there are actually many earbuds within this data and I'm looking for headphones so let's create one more filter to get rid of the earbuds so let's apply a string contains filter look for earbuds and then in first the result alright and now for The Grand Review so let's check which headphones using this method scores the highest on the polarity score using the keywords that we've determined for our criteria alright so we got the URL let's open it up and the winner is the elec circuits headphones all right okay we can even choose some fancy colors yeah nice I love it it's foldable portable HD mic for Clear conversations share the fun together HD stereo beautiful sound this could be you seems like he's enjoying it and now please take these results with a little grain of salt because we have literally copied all of the steps that were provided by chat gbt only with little adjustments and we can probably come up with better ways to analyze this data and make a more accurate analysis alright let's actually have a quick look at the runner-ups so let's see the second one another kid's headphones again also really good reviews number three and another kids headphones for 15 bucks you can't really go wrong I would say so maybe it's just parents over here that are really happy that they can get an affordable headphone for their kids with decent quality so the results are in and now of course this was just me initially playing with this data set and seeing what jeans T could come up with but by asking some very basic prompts we were able to come up with a five-step plan to analyze this data set and then we created functions for each of the five steps that jet GPT provided us to do and now again please note that we've only considered the polarity of the filtered reviews based on the keywords so we haven't looked at Price we haven't looked at total amount of reviews and apart from some minor tweaks and some bugs this was actually a really cool approach that you can actually build out to make a really thorough sentiment analysis so I'm actually really happy with the results actually quite surprised so in this video I showed you how you can get data in minutes using bright data and then using chat GPT to generate the code and you can do this basically for any data that is on the web so if you want to play around with bright data and well and start collecting your own data sets you can use the first link in the description to create an account and then once you book a demo you can get 25 dollars in credits and now this is a great way to get a unique data set for a data science portfolio audio project so instead of using the Titanic or any other data set that everyone is using basically and recruiters are just not impressed by anymore you can use bright data to get your own data set and then use chatgpt to create your portfolio project but you don't have to tell that you're using GPT so you can just create really cool projects and speaking of portfolio projects if you want to see how I tackle an entire machine learning projects from start to finish that you can follow along as well then check out this video next where we create a fitness tracker using python

Original Description

Book a demo at Bright Data and get your own web datasets. As a special offer, you'll receive $25 in credits to start your data journey. Get started today: https://get.brightdata.com/datalumina. In this video, I am going to analyze a massive dataset of 1,300,000 million headphone reviews with Python. The dataset was scraped using the Bright Data platform, and the code was generated by ChatGPT. This way, I was able to complete this data science portfolio project in a couple of hours. We'll be using various natural language processing (NLP) techniques to perform a sentiment analysis on the reviews to determine the winner!

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Dave Ebbelaar · Dave Ebbelaar · 31 of 60

← Previous Next →

How to Install Homebrew on Mac (Getting Started)

How to Install Homebrew on Mac (Getting Started)

How to Install Python on Mac (Homebrew)

How to Install Python on Mac (Homebrew)

How to Install Anaconda on Mac (Getting Started)

How to Install Anaconda on Mac (Getting Started)

How to Set up VS Code for Data Science & AI

How to Set up VS Code for Data Science & AI

How to Use Git in VS Code for Data Science

How to Use Git in VS Code for Data Science

Data Science Desk Setup to Maximize Productivity

Data Science Desk Setup to Maximize Productivity

THIS Is How I Write Clean Data Science Code EVERY TIME

THIS Is How I Write Clean Data Science Code EVERY TIME

Data Science Tutorial - Project Structure

Data Science Tutorial - Project Structure

Changing rcParams for Better Data Science Plots | Matplotlib Tutorial

Changing rcParams for Better Data Science Plots | Matplotlib Tutorial

How to Read Excel Files with Python (Pandas Tutorial)

How to Read Excel Files with Python (Pandas Tutorial)

My Data Science Journey (Zero to Freelance)

My Data Science Journey (Zero to Freelance)

How I Automate Data Visualization in Python

How I Automate Data Visualization in Python

16 Apps I Use Daily as a Data Scientist

16 Apps I Use Daily as a Data Scientist

How to Manage Conda Environments for Data Science

How to Manage Conda Environments for Data Science

How to Export Machine Learning Models in Python

How to Export Machine Learning Models in Python

VS Code Speed Hack for Data Science

VS Code Speed Hack for Data Science

17 VS Code Tips That Will Change Your Data Science Workflow

17 VS Code Tips That Will Change Your Data Science Workflow

How to Predict the Future with Python (Forecasting Tutorial)

How to Predict the Future with Python (Forecasting Tutorial)

How to Use Python Environment Variables

How to Use Python Environment Variables

7 Data Science Tips for Beginners in 2023

7 Data Science Tips for Beginners in 2023

How to Effectively Use the Data Science Lifecycle

How to Effectively Use the Data Science Lifecycle

Full Machine Learning Project — Coding a Fitness Tracker with Python (Part 1)

Full Machine Learning Project — Coding a Fitness Tracker with Python (Part 1)

Full Machine Learning Project — Processing Raw Data (Part 2)

Full Machine Learning Project — Processing Raw Data (Part 2)

Full Machine Learning Project — Data Visualization with Matplotlib (Part 3)

Full Machine Learning Project — Data Visualization with Matplotlib (Part 3)

This Will Change Data Science as We Know It (ChatGPT)

This Will Change Data Science as We Know It (ChatGPT)

Full Machine Learning Project — Detecting Outliers in Sensor Data (Part 4)

Full Machine Learning Project — Detecting Outliers in Sensor Data (Part 4)

Full Machine Learning Project — Low-pass Filter & Principal Component Analysis (Part 5a)

Full Machine Learning Project — Low-pass Filter & Principal Component Analysis (Part 5a)

Full Machine Learning Project — Fourier Transformation & Clustering (Part 5b)

Full Machine Learning Project — Fourier Transformation & Clustering (Part 5b)

Full Machine Learning Project — Predictive Modelling (Part 6)

Full Machine Learning Project — Predictive Modelling (Part 6)

Automate Machine Learning with ChatGPT

Automate Machine Learning with ChatGPT

Scraping Web Datasets for Data Science Projects

Scraping Web Datasets for Data Science Projects

Full Machine Learning Project — Counting Repetitions (Part 7)

Full Machine Learning Project — Counting Repetitions (Part 7)

How to Use GitHub Copilot for Data Science (Python + VS Code)

How to Use GitHub Copilot for Data Science (Python + VS Code)

Every Beginner Data Scientist Should Understand This

Every Beginner Data Scientist Should Understand This

Revealing My New AI-Powered Data Science Workflow

Revealing My New AI-Powered Data Science Workflow

Auto-GPT Tutorial - Create Your Personal AI Assistant 🦾

Auto-GPT Tutorial - Create Your Personal AI Assistant 🦾

Build Your Own Auto-GPT Apps with LangChain (Python Tutorial)

Build Your Own Auto-GPT Apps with LangChain (Python Tutorial)

Building Slack AI Assistants with Python & LangChain

Building Slack AI Assistants with Python & LangChain

ChatGPT Code Interpreter - Goodbye Data Analysts?

ChatGPT Code Interpreter - Goodbye Data Analysts?

How to Deploy AI Apps to the Cloud with Flask & Azure

How to Deploy AI Apps to the Cloud with Flask & Azure

How to Build an AI Document Chatbot in 10 Minutes

How to Build an AI Document Chatbot in 10 Minutes

Is Falcon LLM the OpenAI Alternative? An Experimental Setup with LangChain

Is Falcon LLM the OpenAI Alternative? An Experimental Setup with LangChain

GPT Engineer... Generate an entire codebase with one prompt

GPT Engineer... Generate an entire codebase with one prompt

Pandas DataFrame Agent... the future of data analysis?

Pandas DataFrame Agent... the future of data analysis?

OpenAI Function Calling - Full Beginner Tutorial

OpenAI Function Calling - Full Beginner Tutorial

How to use ChatGPT's new “Code Interpreter” feature

How to use ChatGPT's new “Code Interpreter” feature

LangChain just launched their new "LangSmith" platform

LangChain just launched their new "LangSmith" platform

How I'd Learn AI (if I could start over)

How I'd Learn AI (if I could start over)

I Used AI To Scrape The Web & Write PDF Reports

I Used AI To Scrape The Web & Write PDF Reports

LangSmith Tutorial - LLM Evaluation for Beginners

LangSmith Tutorial - LLM Evaluation for Beginners

7 Lessons for New AI Engineers - Beginner’s Guide

7 Lessons for New AI Engineers - Beginner’s Guide

The Rise of the "New-Age" Machine Learning Engineer

The Rise of the "New-Age" Machine Learning Engineer

OpenAI Assistants Tutorial for Beginners

OpenAI Assistants Tutorial for Beginners

How To Connect OpenAI To WhatsApp (Python Tutorial)

How To Connect OpenAI To WhatsApp (Python Tutorial)

How to Build Chatbot Interfaces with Python

How to Build Chatbot Interfaces with Python

PostgreSQL as VectorDB - Beginner Tutorial

PostgreSQL as VectorDB - Beginner Tutorial

My MacBook Setup (as a coder & business owner)

My MacBook Setup (as a coder & business owner)

Easiest Way to Connect AI Chatbots to WhatsApp

Easiest Way to Connect AI Chatbots to WhatsApp

ClickUp Tutorial - What Is ClickUp Brain? 🧠

ClickUp Tutorial - What Is ClickUp Brain? 🧠

My Development Workflow for Data & AI Projects

My Development Workflow for Data & AI Projects

This video teaches how to collect and analyze large web datasets using Bright Data and Python, with a focus on natural language processing and sentiment analysis. The instructor demonstrates how to use LLMs like ChatGPT to generate code and perform tasks like keyword extraction and sentiment analysis.

Key Takeaways

Collect web data using Bright Data
Load data into a pandas dataframe
Remove quotes from review text column
Extract useful information using NLP
Create a subset of records for validation
Prepare text data by converting to lowercase and removing punctuation
Use ChatGPT to generate keywords for identifying specific product features
Create a function to return rows containing generated keywords
Use textplop for sentiment analysis and polarity calculation

💡 The video highlights the importance of using LLMs like ChatGPT to automate tasks like code generation and keyword extraction, and demonstrates how to apply these tools to real-world data analysis problems.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Prompt Craft

View skill →

Build Hour: Prompt Caching

Build Hour: Prompt Caching

Advanced Prompt Engineering Course

Advanced Prompt Engineering Course

Organizing Your AI Prompts with Jinja Templates with ChatGPT & OpenAI

Organizing Your AI Prompts with Jinja Templates with ChatGPT & OpenAI

Automata Learning Lab

Creating a Game Prototype with Amazon Q and Amazon Bedrock (Prompt Engineering on AWS)

Creating a Game Prototype with Amazon Q and Amazon Bedrock (Prompt Engineering on AWS)

Switch from ChatGPT to Claude in 5 Minutes (Without Losing Your Memory)

Switch from ChatGPT to Claude in 5 Minutes (Without Losing Your Memory)

Create End to End AI Chatbot using Lovable.dev in 5 Mins!

Create End to End AI Chatbot using Lovable.dev in 5 Mins!

Related AI Lessons

Best AI Tools and Software Reviews: 2026 Picks

Discover the best AI tools and software for your specific needs in 2026, and learn how to match them to your work for optimal results

Verify real estate listings with Dwell, a platform that checks claims against records before you sign

Reddit r/artificial

X now offers an MCP server to make its platform easier for AI tools to use

X launches a hosted MCP server to simplify AI tool integration with its API

n8n Automation Repurpose Video Content: The 2025 Production Guide

Learn to repurpose video content using n8n automation, replacing manual labor with a self-hosted workflow solution

How to Open HPL Files (HP-GL Plotter)

File Extension Geeks