Web Scraping with Node.js
Key Takeaways
Performs web scraping using Node.js and libraries like request-promise and cheerio
Full Transcript
hi in this tutorial i'll be going over how to do web scraping using node i was kind of inspired by this article i read client-side web scraping with javascript using jquery and regex and i wanted to continue this to kind of make it even better so instead of using client-side web scraping i decided to use node because so i don't have to deal with the cross-origin issues when you're using node you can basically get information from anywhere and it's not going to give you errors in this article what she does is she gets actually information from the free code camp website and finds the number of challenges completed so what i thought would be interesting is if we can combine that with with uh web scraping on a different free code cam website on the free code camp forum there's forums dot freecodecamp.org users and it's basically like a leaderboard it's gonna rank the users by how many hearts that they received so i thought it'd be interesting to compare how many hearts that they received compared to how many challenges they've passed on free code camp just to see if there's any correlation between number of challenges and how far they've gotten into the free code camp curriculum with the number of likes that they've received on the free code camp forum now you don't necessarily have to even be on free code camp in the curriculum to be on the forum so there may not be any correlation but i wanted to use some web scraping to find out the first thing we're going to do is going to go into sublime text here and i'm going to just require some dependencies so we're going to do const rp equals require and the first thing is going to be called request promise and this is going to make it easier to make ajax requests from other websites so we're going to also have to do mpm install reque request promise okay well that's installing i'm gonna do another line here const cheerio equals require it's going to be called cheerio material makes it so you can use syntax similar to jquery from within node so that's another thing that'll be easier to navigate the dom and different things like that when we're doing the web scraping and let's get that installed npm install cheerio then the last thing is const table now this next one is just supposed to make it easier to format our information for this project we're not going to create a web browser or anything it's just going to display the content the results right in the console so this cli dash table is going to make it easier to do that so i'll do npm install cli dash table and then we'll just let that install okay so we're going to need to set up some variables um one of them is going to be the let users we're just going to have to have an empty array of of the users here so let's start our our first web scrape here so first we're going to have to set our options this is something for the request promised i'm going to set the options the first thing is going to be the the url we're going to get the data from now we still have to figure that out so if we go over here to this list we have to figure out how to get all this information now now here's one of the problems a lot of websites render all the render the page with the javascript so like if i go over here into view page source and i search for let's say um i'll just search for quincy because that was one of the people on there you can see it doesn't find that name anywhere even though we can see the name quincy right here it's not on the source it's because after the page loads it renders all this in javascript now one problem is what when doing ajax requests or requests to other websites is that it's not going to load all the javascript so we can't just load up this page from node.js and expect to see all this content because it's not going to load up all the javascript so what we're going to do is try to find if there's an api that's getting all this information so i'm going to do command option j i guess i didn't need to do j but i'm just trying to open up this developer tools here and i'm going to go to the network tab now i'm going to refresh this one more time and it's going to you can see all the different calls that it's making to get the information on the page it's loading all these different things so let's go up to the top here and i'm trying to figure out if there's a what if there's a list of users okay if i go to this directory items one you can see the whole this direct items and then it has a few other things if i click the down arrow oh you can see this look this is the list of of users see the likes received it starts with 88 51 34 and you can see over here 88 51 34 and if we scroll over here um well let me just drop one of these downs you'll see user and we can actually get the usernames here so what we're actually going to do is use the request to get this information from the api and then you can see the username i'm just going to copy this username really quick if we go over to freecodecamp dot and then i put the username on the end you'll see that we'll get the free codecamp profile for that user the user on the forum doesn't necessarily have to use the same username as the user on the freecodecamp.org website but in many cases they do use the same user so we can use that username to find the number of challenges passed so to find out the number of challenges passed we're going to actually have to count up every item on this list here but we'll get to that in a second right now we just need to get this link here so i'm going to let's see copy copy the link address so now i'll go over here and for url i'll put this in some quotes and maybe i will zoom out a little bit here so we can see the whole link here and then i'm going to put um json to true that means the result is going to parse the json for us and make it a little easier to deal with the data so now we're going to do rp and then pass in the options where rp was this request promise that's going to do a ajax request and return a promise since it's returning a promise we can do a dot then so once the promise is resolved that means once it's able to get the information from that website then it's going to do something so let me pop this down to the next line and then we're going to try to figure out what it's going to do it's going to give us some information and i'm going to put all the information in this just a variable called data and now let's we're gonna make this arrow function here okay let's set up some variables we have let promises equal to let user data equal and so we're also going to make an array of the user data now we're going to for let user of data dot directory items and so it's getting this directory items right from this data here if you look at this page over here you can see that the top level of this results is directory item so we have to get into that and i'm going to do user data dot push i'm going to put something onto that array which is going to be the the data from the user basically so the name is user.user.username and i figured out what this was going to be the user.user.username just because you can see first it has to if you look in this data here we first we get the user dot username right there so also i'm going to set the likes received to user dot likes received and now let's see i'm going to just add a semicolon that's all we're going to do for that we'll in this for loop and we've pushed all the data onto there oh and now that i'm putting this in here i just realized that i tried a few different things and i don't actually need this array of promises when i was figuring out this code at first i tried something with this but actually i don't need this anymore so i'm going to take that out um now i'm just going to go down here there there was a time where i was going to try to call a promise for for each user but then i decided to go about it a different way now just for the the purpose of my log so you can tell it's happening i'm going to log something instead of doing console.log i'm going to do process st and then just put loading now one thing this does compare to console.log is it doesn't add a new line at the end so you can put everything on the same line so now we're going to call another function here which i'm going to create in just a second which is just get challenges completed and push to user array it's a long function name but does just describe what's happening so now the next function we're about to call is going to do that it's going to get all the challenges completed and then push it to the users array where the users array is right up here so what i'm going to do here is add some semicolons and then i'm going to do a dot catch so let's see that catch just to catch any errors that happen from this ajax call and i'm going to yeah just do this little function here and we're just going to console.log the error if there's an error now i'm going to create that function i'm actually just going to copy this so i don't have to type that again so i'm going to put function and i'm going to have the user data passed in still and let's define that function this is where it's going to go through each user and make another request for each user on the list there's a few ways to do this but i want each request to be in order so that the the hearts will be in order like this so i just want to make sure that the data the data is in the order that it is on here so when you make a different request it's asynchronous so you don't know which result is going to come back first so the way i figured that out to do well let me just show you how i'm going to do it to get the everything in order you may be able to find a better way to do this but this is just one of the first things i came up with i'm going to do var i equals zero i'm going to set this is basically a counter variable and then i'm going to create a function called next so we're actually going to keep calling this function it's going to be use some recursion here so if i is less than userdata.link if we haven't gotten to the end of the list we're going to do this we're going to set up a new request so options equals now remember this is just like this one up here up here where we have a url and we have if we want to do anything with the json so i'm going to put url and this is going to be now if we go back over here we can see that it's just going to be free codecamp.org slash and then it's the username so i'll just copy that and paste that in here freecodecamp.org slash and then this is where i'm going to add a plus and user data we're going to be looping over this so i is the index that we're looping through dot name and i am going to just change this just to these back ticks that's what the best practice is now so let's see now i'm not actually going to do anything with the json i'm going to do something else it's going to be it's going to be transform body now this is where we're going to use that cheerio that we brought in so cheerio dot load body we're going to be able to navigate this this html kind of like using something like jquery and now we're going to make the actual request so rp for request promise pass in the options and it's me dot then function here's the results i'm going to put into a variable that's just a dollar sign kind of inspired by jquery um just so we know every time something's loaded i'm going to do process dot std out and then we'll know if a user is being loaded so now one thing i also want to check for is well if we go back over here to this list of users i'm going to try to find one um trying to see if i can remember what basically some of these don't have the same username on freecodecam so if i copy this username here and i put it again at the end of my url it says we couldn't find a page for slash john dashfree codecamp so i want my results to be able to know whether it's actually getting data from the freecodecamp user page or not so the way i figured out to do that was just to check if there's a so if i go into here and if i inspect this you can see it's landing dash heading so i'm going to see if that exists so i'm going to do const fcc account this is going to end up being a boolean equals and then this is where we can do a look for that tag h1 dot landing heading and then i'm going to put dot length equals equals zero so if the length is zero that means it doesn't exist on the page this will be true if they do have an account and false if they don't have an account with that username we're gonna do cons challenges past this is to figure out how many challenges they've passed and actually the code for this i got right from the article i made it pretty similar to how how they got the number of challenges on here so first i want to see if they have an account so this is going to be a ternary operator do they have an account let me scroll down here so if they do have an account then we can find out how many challenges they pass so we're going to look for t body tr and then do dot length and if not we're just going to do a string we're going to set the challenges passed to unknown we don't know how many challenges they passed maybe they have a different username that they use on freecodecamp.org so if we go back over here and i'm going to go back to this page so if i the reason why this works if i inspect this you can see we have a tr um let me go back so it's t-body tr so in the t body and the tr so each tr is its own row on here so basically we're just counting the rows in there and the number of rows is the number of challenges that the person passed so when you're doing web scraping you often have to go into the code and try to figure out exactly what you need to count for it to make sense to get the actual data we could have just gotten this number right here but that number is not just that number includes the challenges passed and they could also get a higher number based on helping people in in the the chat room but i just want the number of challenges passed so i want to count every line in this row so it would include like if these projects if i go to inspect this you'll see this also has t-body tr so it's going to include the projects that they passed too so it's going to include everything on here it doesn't include the headings or anything so when it counts every single one line in this table that will give you the the number here and so again when you're web scraping you just have to look a lot at the code to figure out the best way to do it so luckily this is not rendered with javascript it's just um we get a page that has all that in there within the html so that's why we're able to do this we got the results but we want to push the results to a table so let's set up the table options up here because we're going to make a table this table is just going to appear in the console so i'm going to put let's see next i'm not going to put there i'm going to put right down here so let table equals new table and this is just from that cli table thing that we brought in here it needs specific options we have the the heading so what's the heading of the table going to be like and the first column is going to be username and then the second column it's going to be hard so if i do control command space that should bring up this if you're using mac so i'm going to put in a heart there and then challenges so i just use that emoji because of how many hearts that they've received and now i also have to make the column width so i'm going to put c-o-l-w-i-d-t-h-s widths and i had already experimented with this and i found that 15 5 10 are going to be the good column widths now let me go back over here table dot push and then i'm going to push on this information this array so it's going to be user data and then i for what index we're on dot name then user data i dot likes received and challenges passed so those are all the information for the different columns and then i'm going to just increment the i and then i'm going to put return next it's going to keep running this function and if i is less than userdata.link it's going to do all this where it gets the next user's information and then eventually i will it will run function.next and i will not be less than user.length so we're not going to go through this and return and run the return next so this is basically our base case so our base case i'm going to put else print data so i'm going to call this other function call where it's going to print the data that's collected so let me go down here oh so we also have to call this return next to begin with so down here after the function next we have to actually call that so i'm going to put return next to call that function so it's going to start that so now we're actually getting close to the end here we're going to define our print data function so i'm going to put function print data okay i'm going to put console.log we've been printing things to the same with the console.standardout.right printing sys to the same line so we want to get something to the new line so we're just going to do a console.log just something to put on the to the new line and i'm just going to use this check box here and then i'm going to do another console.log and i'm going to print table dot 2 string so that is going to print out our whole table in the console now i'm probably have some kind of errors here i don't know i'm going to actually try running this and see if it works so i'm going to go over to my console and i'm going to just put node index dot js so let's see unexpected identifier let's see what i did wrong there oh i just need a comma at the end of here okay now let's see what happens let's see throw air cannot find module request okay we'll just let me just do npm install request it looks like we just need the module request okay now let's try it again so it did something you can see has the word loading here but then the program seemed to end right away so i had to kind of figure out why that is so somehow it oh oh okay i see what the problem is i forgot to put the parentheses here so it never called that function correctly to start off going into the function so now let's try it okay unhandled it says oh cheerio is not defined so let's see what oh i spelled cheerio wrong here so save that and go into my here again okay it's actually the dots mean it's actually doing these calls the requests to the website to get the data so this does seem to be working let's just wait till this finishes and here's the table so let me just scroll up here so you can see it loaded all that got username the heart challenges you can see when it puts in the emoji it kind of messes up the spacing here that's okay though but you can see here's the heart you see you see 88 51 34 so if we you can kind of if we go over to here you can see it's getting that number um 88 51 34 and then the way to check that these are right you can see this one has has quincy has less so quincy larson if you go over here and put that in there um so the way to check is to go through and count every row i'm not going to do that i've done it before to check and it was right i'm not going to do in this video but feel free to do that if you want to check you can also go and count 291 but that's a lot easier just to count the smaller ones just to make sure it's calculating it correctly but you can see you can see here to now we can have the number of hearts on the forum with the number of challenges passed on the website and we can see how they're related so you can kind of when you actually look through this it doesn't seem like there's a major correlation like this person has only four hearts but also he's passed 403 challenges same with it a lot of these have passed a lot of challenges but they don't have very many hearts and really the person who has the fewest challenges passed is quincy larson but he has one of the the number three to fourth spot for number of hearts and you can see some of these there's actually quite a few unknowns here i'd be interested to it just makes me wonder hmm i wonder do these have these people actually use freecodecamp.org or they're just participating on the forum most likely they do use freeco camta the main site and just have a different username but maybe not so that this web scraping is a great way to combine information from two different sites to get the exact information you're looking for i'm sure that some of you watching may have even better ideas of how to do this so if you figure out a better way to do some of these things put your idea in the comments to this video so everyone else can can see that because i love to learn new ways to do things and it'll be great to see what other people are doing in regards to web scraping and that's it my name is beau carnes thanks for watching don't forget to subscribe and remember use your code for good
Original Description
Learn how to do basic web scraping using Node.js in this tutorial. The request-promise and cheerio libraries are used.
💻 Github: https://github.com/beaucarnes/fcc-project-tutorials/blob/master/node-web-scraping/index.js
🔗 Article on client-side web scraping : https://medium.freecodecamp.org/client-side-web-scraping-with-javascript-using-jquery-and-regex-5b57a271cb86
🐦 Beau Carnes on Twitter: https://twitter.com/carnesbeau
---
Learn to code for free and get a developer job: https://www.freecodecamp.com
Read hundreds of articles on technology: https://medium.freecodecamp.com
And subscribe for new programming videos every day: https://youtube.com/subscription_center?add_user=freecodecamp
❤️ Support for this channel comes from our friends at Scrimba – the coding platform that's reinvented interactive learning: https://scrimba.com/freecodecamp
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from freeCodeCamp.org · freeCodeCamp.org · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
React: Production Server Setup Part 2 - Live Coding with Jesse
freeCodeCamp.org
cookies vs localStorage vs sessionStorage - Beau teaches JavaScript
freeCodeCamp.org
Browser history tutorial - Beau teaches JavaScript
freeCodeCamp.org
Graph Data Structure Intro (inc. adjacency list, adjacency matrix, incidence matrix)
freeCodeCamp.org
React: Parameterized Routing with Next.js - Live Coding with Jesse
freeCodeCamp.org
React: Dealing with jQuery Issues - Live Coding with Jesse
freeCodeCamp.org
setInterval and setTimeout: timing events - Beau teaches JavaScript
freeCodeCamp.org
Browser and Device Testing - Live Coding with Jesse
freeCodeCamp.org
Last Minute Updates - Live Coding with Jesse
freeCodeCamp.org
Post Launch Updates - Live Coding with Jesse
freeCodeCamp.org
React: Setting Up Google Analytics - Live Coding with Jesse
freeCodeCamp.org
React: Masonry Layout - Live Coding with Jesse
freeCodeCamp.org
Load Balancing Digital Ocean Droplets - Live Coding with Jesse
freeCodeCamp.org
try, catch, finally, throw - error handling in JavaScript
freeCodeCamp.org
Load Balancing: SSL Passthrough Setup - Live Coding with Jesse
freeCodeCamp.org
Graphs: breadth-first search - Beau teaches JavaScript
freeCodeCamp.org
React: Masonry Layout Part 2 - Live Coding with Jesse
freeCodeCamp.org
React: WordPress API Live Search - Live Coding with Jesse
freeCodeCamp.org
Creating WordPress Custom Post Types - Live Coding With Jesse
freeCodeCamp.org
Dates - Beau teaches JavaScript
freeCodeCamp.org
Miscellaneous Front End Updates - Live Coding with Jesse
freeCodeCamp.org
Merging a Pull Request from GitHub - Live Coding with Jesse
freeCodeCamp.org
React + Prettier + Standard JS - Live Coding with Jesse
freeCodeCamp.org
React: Sortable Responsive Table - Live Coding with Jesse
freeCodeCamp.org
Geolocation Sorting by Distance - Live Coding with Jesse
freeCodeCamp.org
Tradeoff Matrix - Agile Software Development
freeCodeCamp.org
The Definition of Ready - Agile Software Development
freeCodeCamp.org
Getting first React job without experience - Ask Preethi
freeCodeCamp.org
React: Google Analytics Click Tracking - Live Coding with Jesse
freeCodeCamp.org
Submitting a PR to an Open Source Project - Live Coding with Jesse
freeCodeCamp.org
Should I go back to school to get CS degree? - Ask Preethi
freeCodeCamp.org
Hero Section CSS Changes - Live Coding with Jesse
freeCodeCamp.org
Working Agreement - Agile Software Development
freeCodeCamp.org
A day at Pennybox with Co-Founder Reji Eapen
freeCodeCamp.org
React: Sorting and Filtering Data - Live Coding with Jesse
freeCodeCamp.org
React: Sorting and Filtering Data Part 2 - Live Coding with Jesse
freeCodeCamp.org
React: Building a New UI - Live Coding with Jesse
freeCodeCamp.org
Definition of Done - Agile Software Development
freeCodeCamp.org
Getting started with jQuery (tutorial) - Beau teaches JavaScript
freeCodeCamp.org
Making a React Blog with WordPress Content - Live Coding with Jesse
freeCodeCamp.org
React, NextJS, CSS - Live Coding with Jesse
freeCodeCamp.org
jQuery events - Beau teaches JavaScript
freeCodeCamp.org
React/NextJS Routing and WordPress API Custom Types - Live Coding with Jesse
freeCodeCamp.org
React: Working with API Data - Live Coding with Jesse
freeCodeCamp.org
React: Refactoring Components - Live Streaming with Jesse
freeCodeCamp.org
jQuery effects - Beau teaches JavaScript
freeCodeCamp.org
More React Refactoring - Live Coding with Jesse
freeCodeCamp.org
animate in jQuery - Beau teaches JavaScript
freeCodeCamp.org
"Finishing" My React Site - Live Coding with Jesse
freeCodeCamp.org
Starting a New React Project (P2D1) - Live Coding with Jesse
freeCodeCamp.org
React Project 2 Day 2: Learning Material UI - Live Coding with Jesse
freeCodeCamp.org
The Agile Manifesto - Agile Software Development
freeCodeCamp.org
jQuery: get and set with http, text, val, and attr - Beau teaches JavaScript
freeCodeCamp.org
React Project 2 Day 3 - Live Coding with Jesse
freeCodeCamp.org
The INVEST approach to product backlog items
freeCodeCamp.org
React Project 2 Day 4 - Live Coding with Jesse
freeCodeCamp.org
Chickens and Pigs - Agile Software Development
freeCodeCamp.org
React Project 2 Day 5 - Live Coding with Jesse
freeCodeCamp.org
jQuery: add and remove DOM elements - Beau teaches JavaScript
freeCodeCamp.org
React Project 2 Day 6 - Live Coding with Jesse
freeCodeCamp.org
More on: JavaScript Fundamentals
View skill →
🎓
Tutor Explanation
DeepCamp AI