Requests-HTML - Checking out a new HTML parsing library for Python
Skills:
API Design60%
Key Takeaways
Explores the Requests-HTML library for parsing HTML in Python
Full Transcript
what's going on everybody and welcome to a new tutorial slash some coverage of a new package called requests - HTML it's basically just a way for you to really quickly easily parse some HTML so it's written by the same person who wrote the request library so my expectations are high but this is a very new package it's been on github for about a month maybe a little over so I would expect some rough edges but let's check it out so first of all to install it let's scroll on down to the bottom to install it you just pip install requests - HTML you will need Python 3.6 or I'm assuming later but right now the the latest release of official python is 3-6 so let's make sure you install it and then let's go ahead and check it out so to check it out I'm going to be using the following web page at least to start it's just Python program it's slash parse me make up our space and basically it's got some tags so list a table it's got imagery it's also inside of div tags just so you can kind of poke around that it's got some JavaScript tests so if the JavaScript loads it says look at you shine in and then if it before it loads its default is like why you bad though I think then we've got some pre tag information in here with the Zenna python and we've got a link down here and then we've got some goofy looking characters here that it could throw you off so with that this is what we're gonna start off by parsing so coming back over here it looks pretty simple I'm just gonna like copy and paste this and just remove the interactive interpreter - things so it should be that simple so you can request get I'm gonna pass in my own URL you can feel free to use something else if you want I'm just going to be exemplifying some of the things I use this page because for this it's a one page I can control what happens to it it seems like every time I do a tutorial on literally anybody else's pages or with anybody else's API they all get deprecated they all change so this one should be a rock now so that starts your session we get some stuff and then we can start actually interacting with it so again I don't really know everything with this package but generally when I check out a package mean they might I'm not sure if he has like a documentation page it doesn't look like he does let's see oh maybe he does let's see what this looks like it's pretty much the same thing for HTML this is interesting pagination that's uh that's more magical than I would have expected that's incredible if that really is that simple to do pagination wow that's not written up here okay so anyway that's interesting to say the least so then you request the oh mine and then you can just iterate over it oh that might work really well with Reddit I'm curious to know if that works well and other websites we'll have to find well it's a test that maybe at the very end you can use it without request so if you had downloaded maybe some HTML documents you could then parse them yeah this has a lot more explanation Wow okay cool I'm really that's that pagination so for that's just icing on the cake man I would think you'd have to find the the link to it and then go to the link the fact that the pagination I'm sorry I'm that's just that's just interesting so back to the simplified version of the docs though all we need to do is once we get we can just start referencing it I'm pretty sure as long as it's not JavaScript so yeah immediately we can just say like for example let's just do our dot well first of all let's print Durer ours so let's just see all the things that we could do with it let me just pull this up so immediately on the request we could check I guess the encoding I get a parent encoding we can close the request cookies and coding I'm not sure with the difference between encoding a parent encoding is check the headers history we can reference the HTML we can check if it was a redirect that's interesting so it just automatically does redirect you but you can find out if you've been redirected that's interesting JSON I wonder if you would just pull it I'm not sure if you said dot JSON like if it is like if it's like some sort of API or something in the responses JSON if you could say JSON it's a JSON object now I'm not sure I'm guessing that's all the links that's if it's paginate 'add apparently which is fascinating also be cool if it wasn't paginating I wonder if you could make some sort of crawler that just like automatically goes to any link and just keep slinking around if anybody that's trying to build just like a web crawler that's kind of an annoying thing to have to build okay cool so we've got text URL and then the HTML so let's check out let's just check out dot HTML so first of all that's I guess we'll print out the Derby HTML as well but let's just see what so we've got text raw HTML search so we could search the HTML I'm assuming links find all of the links so like for example it's print dot HTML dot links so what that looks like okay so it looks like it gives us a set now I guess that makes sense to give it a set just in case there's multiple of the same I'm guessing that's a set I don't know we can type just looks like a set yeah so it gives us a set of the links what else could we print out our links we've got our dot HTML wasn't that adder of our to HTML ooh I think it's all formatted and beautiful for us so what's the difference between HTML and raw HTML oh yeah ok that gives us like the new lines in like all the tabs and spacing and stuff like that okay okay I'm not sure exactly why you would need to do that but okay so that's interesting the other things let's see let's check out find you can find it looks like ah the IDS alright I'm not sure let me see if if we have IDs here I know at least we have probably the ID in the yeah yes no js' I wonder if we can find that one I'm just gonna copy paste are not find yes no js' print about even though that's kind of bad words ah darn it I see uh probably text I'm just guessing nice okay so that's how we could find it I don't like what it's saying to me but that's how we can find it which then brings me to the next curiosity I have which is with the JavaScript rendering so what's happening here and first of all let's just say I don't know Jay s test there's copy that paste that here pull this up save okay so at least the whole point of this was to test if we could actually load JavaScript when we parsed things since that was part of an older tutorial but now with request HTML apparently it's super simple so let us check that out so if we scroll on to the documents it looks like basically all you got to do is just use dot render so so let's go ahead base it right after the request so let's just paste that in there so with JavaScript the immediate return will be the base javascript including like the actual JavaScript code and in order to render JavaScript you actually have a browser that runs it like that's how JavaScript works like you request the server and the server just responds in just text obviously and then your browser is gonna see you okay here's some script we want to run this script so what you have to do in order to read elements inside of tags that are updated by JavaScript is you need to render it and you also have to kind of wait it a moment you have to wait until it's done rendering so that's kind of a tedious thing to build yourself but apparently it's in in here and ready to go so let's go ahead and run it and let's see if it updates it so on the first run it looks like you have to install chromium so I get scan it download it download may take a few minutes I hope it doesn't I don't really want to wait that long maybe I'll pause it while we go oh I think it's done okay it's done okay and it's just telling you where it extracted to mmm and we had an error unable to remove temporary user data interesting it did work though that's the that's this is the text right here that we were looking for so like if we just kind of make some space here since the errors in our way it's probably got permissions there honestly anyway yeah that's what we were looking for and that's indeed and you know if we come over here so it definitely works but for some reason we're getting an error when we attempt to remove the user data so let's open up this file the launcher so at least in sublime it's really nice you can just double click on the thing that airs and we could check it out okay so in this internal method we're trying to clean up some data and they're using Shu till remove tree so it's trying to remove a directory basically that's how you're going to remove directory with contents huh so we already we saw where it went it goes into users H and then that let's see let me see if I can find it yeah so it goes in you know here it is and then if we go into there we can see that the contents but for some reason we're unable to delete them I'm gonna guess it's a permissions error one way we can confirm that is to not ignore errors we'll set that to false just in case anybody's like oh why would you ever ignore the errors I forget that I think if you don't ignore the error I think one of the errors is that the file directory has contents and that'll stop it from working if you ignore errors it does work but we also don't get to see the error so let's go ahead and see cuz things aren't working but once you once it is working once SH util does work for you you actually kind of wanna yeah access is denied you actually do want to put ignore errors to true so let me just set this to true again so we're seeing here access to tonight just we don't have the permissions I'm kind of curious I kind of want to open up an administrator to see if I can break that let me pause for a moment let me open it up and head to where we're working and see if we can use at least command prompt as administrator and see if this works okay let's give it a shot we're administrator control panel Wow let's still can't remove it we did get the info here though hmmm but it's gonna break any script that's running too though that's really quite the bummer at least right now the show must go on so I'm gonna continue at least a short-term fix that I can think of I'm yeah why wouldn't that work I'm pretty sure if you used your own like Linux and you sudo though that this wouldn't cause any problems for you it's maybe it's the location I'm not really sure I really can't decide why that won't delete if somebody has a better idea go for it otherwise temporarily one option we have is to just not raise the IOA don't raise the io air instead we could just print we can just print unable to remove like we don't actually have to we don't need to delete that file we just would have liked to to keep things clean alright let's try again cool okay so now it'll still at least print out to us that something went wrong but we would be able to continue so like if like for example actually I'm not even sure moving already moving along was that a that's kind of weird cuz the other print actually I think was working so we are moving along maybe at maybe it wasn't I thought there was space being made that doesn't make any sense to me though if it would raise that error why it would continue working wow it actually does okay magic I guess we could leave the IO air there to be honest because it does can we were able to continue moving along I don't know if that's bothering you though you can get rid of it let's see what else we could do so I'm curious to parse that table also let's try to parse let's purse Finance dot yahoo.com that sounds like a fun one I'm pretty sure they block automated requests now okay so what we can what we can do is like if we go to amazon.com one of the things I used to do is parse from here and if we go to like pry statistics right there's all these statistics and for a while they were just I'm not even sure how they did it to be honest but it must have nodded it just wasn't with JavaScript but now it is with JavaScript and these values are not like static they're updated via JavaScript so and now I'm kind of curious I wonder how they did do it like I guess you could like if you use like Jinja formatting let's say you wouldn't necessarily use javascript to update those values so maybe that's it anyway let's head to this page booth and now what I'm curious about is like can we find I wonder first dot true if we don't do that if it'll find us a list like so for example what I'm gonna be looking for is like let me just do it I'm just gonna go into the source code and I'm gonna search for forward p/e so it looks like we could look for any table data for example like so like what if we wanted to just find all table data so probably we have to use a tag as my guess let's see if they have any tags here so about text links okay find a so this would be any so I wonder if we could just do find TD like fine table data about HTML about attributes about I think you can just use dot find in any bit of test text so let's try this let's say or dot render let's do print or dot finds let's find all the table data let me just see what we get because I think we'd actually have to be table data text probably I'm not really sure we're still gonna see a stupid message to HTML response object so probably dot HTML maybe it needs to be hard that's weird all right yeah so it needs to be our dot HTML finds probably it's hard because they don't have just like searching purely HTML but I'm pretty sure you need to do the HTML yeah yeah cool wow that's super neat so you can get all the table data and then in this case all it's super annoying with their table data in the Yahoo is each one is has a unique I'm pretty sure or does have a unique class they are actually all the same yet they're all the same class fine table data and I wonder if we can search for the class specifically to in in request dot HTML dot find that would be interesting to know if we can do that I'm sure you can let's see yeah oh well you have class at least there where's the UH where's that link to the full documentation oh right here cool let's see if we can do a find first equals true about class about time attributes like I wonder if we can find specific classes I'm sure there's a way I'm just trying to figure out what the parameter would be wait oh so could we say TD dot let's just try let's try ooh what about with the spaces though I guess it yeah hmm I don't think this is gonna work I don't have high hopes for this one I wonder I think it would find him if it didn't have spaces but let's just see what happens I kids know and now I don't even know what the other error was because we ran out of space apparently oh maybe this is it there just wasn't anything hmm yeah it looks like the parentheses threw it off bummer not really sure I'm going to go ahead and move along from that one but that's interesting this is we're probably going beyond what it intended us to use this for let me see if any of these have a class real quick like I bet I can do and maybe that dot was for the ID I'm not sure let's do let's go back to parse mimic parse face just curious find I want to find all the div tags give that modal class no we really don't need to render it anymore that way we won't have to see that error anymore yeah cool anyway let's do that so if you wanted to find all the like of a specific class you would use dot cool okay anyway I think that's enough for now this is probably getting pretty long that's just me kind of poking around with the library if you have any questions or something like that feel free to share if you've got some cool things that you found about it you can share that too I'm also pretty curious with the reddit thing I take its word for it I'm trying to think of something real quick that we could search for the pagination I just want to see that work on everything I just understand how you could know that it was pagination I just can't think of something that that I could search real quick that has pagination to it but I wouldn't mind poking around with that I'm gonna guess it works at least on reddit but how would they know how do they know about the pagination intelligent pagination support always improving so I'm guessing it doesn't always work what if we did what is it let's do sue hacker news that's got some pagination to it so what if we go to y news dot Y Combinator calm and then let's see how they did it so they're just saying our dot HTML next so what if I say this our HTML dot next print really has no attribute next let's check reddit real quick let's see if that one worked that's kind of weird why wouldn't no that one doesn't work either well then I'm not sure for our HTML hmm let's see if this one works like I'm just trying to do the example that they've got posted now for HTML let's just do this go away whoops what did I do wrong are equal session docket are equal session that guy yeah I read it and then this two lines don't forget that there goes so it gets the initial one and it definitely is getting the pages and I guess each time it goes to that page to get the next one which is interesting so it definitely knows that pagination let's try again on Hacker News because the next thing the next operator thing doesn't appear to be working Oh started to work they failed so we got to page two and then it went to CNBC so I think it just picked the wrong page so you started here and then we go probably it found page two which was smart and then somehow I got stuck on CNBC not sure why so okay the pageant okay it sounded too good to be true and I think it is but hopefully over time it will improve and that'll be like that's pretty cool anyway I think now is a good time to stop I mostly ran into the edges of this this package but for simple HTML parsing and stuff like that this clearly is super useful you won't need to use beautifulsoup and stuff like that but at least we found some issues with the javascript thing it just can't seem to delete that your temporary directory so eventually you'd probably have to go in there and delete it just manually otherwise i'm trying to think of what else obviously like the iteration didn't really work but still the rendering actually worked and i'm not sure you would hit that issue on linux so just the fact that i'm on windows could be responsible for that issue in them the pagination i mean come on that's really hard i'm surprised that they're even trying to do that but cool at least it works on reddit too bad reddit has P I'd probably rather use that than paginate through them other than that I think that's it you like the search is super useful super quick super simple yeah so pretty cool package if you need to parse HTML I think this would be the way that I would suggest you go well yeah if you've got questions comments concerns whatever feel free to leave them below otherwise I will see you guys in another video
Original Description
Checking out a new HTML-parsing library by the author of Requests:
Requests-HTML github: https://github.com/kennethreitz/requests-html
Sample parsing page: https://pythonprogramming.net/parsememcparseface/
Chat with us on Discord: https://goo.gl/Q9euv3
Support the content: https://pythonprogramming.net/support-donate/
Twitter: https://twitter.com/sentdex
Facebook: https://www.facebook.com/pythonprogramming.net/
Twitch: https://www.twitch.tv/sentdex
G+: https://plus.google.com/+sentdex
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from sentdex · sentdex · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Matplotlib Python Tutorial Part 1: Basics and your first Graph!
sentdex
Python Encryption Tutorial with PyCrypto
sentdex
Python's Logging Function
sentdex
wxPython Tutorials 1: Making Windows GUIs with Python : Installing + 1st window!
sentdex
wxPython Tutorials 2: Making Windows GUIs with Python: Customizing Window Parameters
sentdex
wxPython Programming Tutorial 3: Menu Bar and Menu Button
sentdex
wxPython Programming Tutorial 4: Panels
sentdex
wxPython Programming Tutorial 5: User Input Saved To Variables
sentdex
wxPython Programming Tutorial 6: Multiple Choice Input
sentdex
wxPython Programming Tutorial 7: Adding Static Text and Colors
sentdex
wxPython Programming Tutorial 8: Custom Button Images
sentdex
wxPython Programming Tutorial 9: Tool Bar Items and Sub Menus!
sentdex
Basic PHP Tutorial 13: Multi-dimensional Array
sentdex
Basic PHP Tutorial 15: Functions and Global Variables
sentdex
Basic PHP Tutorial 12: Associative Array
sentdex
Basic PHP Tutorial 14: Foreach loop
sentdex
Basic PHP Tutorial 16: Include and Require
sentdex
Basic PHP Tutorial 7: Assignment, comparison and Logical operators
sentdex
Basic PHP Tutorial 4: Variables and Comments
sentdex
Basic PHP Tutorial 11: Arrays part 1, basic array
sentdex
Basic PHP Tutorial 6: If else and else if conditionals cont'd
sentdex
Basic PHP Tutorial 1: Intro to PHP
sentdex
Basic PHP Tutorial 3: HTML with PHP
sentdex
Basic PHP Tutorial 9: While Loop
sentdex
Basic PHP Tutorial 10: Switch Statement
sentdex
Basic PHP Tutorial 2: Print and Echo
sentdex
Basic PHP Tutorial 5: If else and else if conditional statements
sentdex
Basic PHP Tutorial 8: Arithmatic Operators: Doing math with php
sentdex
Basic PHP Tutorial 17: User Input Form Example / String Manipulation
sentdex
Basic PHP Tutorial 18: HTML Entities and forms cont'd
sentdex
Basic PHP Tutorial 19: Finding words in strings
sentdex
Basic PHP Programming Tutorial 20: Saving to a File / writing and appending
sentdex
Basic PHP Programming Tutorial 22: Hashing part 2: salting
sentdex
Basic PHP Programming Tutorial 23: Variables in Strings and tokenizing
sentdex
Basic PHP Programming Tutorial 21: MD5 Hashing For Security
sentdex
Basic PHP Programming Tutorial 24: String similarity
sentdex
Basic PHP Programming Tutorial 25: Time and Time stamps
sentdex
Basic PHP Programming Tutorial 26: Die and Exit
sentdex
Basic PHP Programming Tutorial 27: MySQL Databases Part 1
sentdex
Basic PHP Programming Tutorial 28: MySQL Database Part 2: Reading From Database
sentdex
Basic PHP Programming Tutorial 29: MySQL Database Part 3: Inputting Data
sentdex
Basic PHP Programming Tutorial 30: MySQL database in Use
sentdex
Django Tutorial Web Development with Python Part 1: Installing Django
sentdex
Python Tutorial: File Deletion and Folder Deletion / directory deletion
sentdex
Python Tutorial: How to Rename Files and Move Files with Python
sentdex
3D Graphs in Matplotlib for Python: Basic 3D Line
sentdex
3D Plotting in Matplotlib for Python: 3D Scatter Plot
sentdex
3D Charts in Matplotlib for Python: Multiple datasets scatter plot
sentdex
Sikuli Tutorial 1: Visually programming in python!
sentdex
Sikuli Tutorial 2: Program visually in python!
sentdex
Sikuli Tutorial 3: Program visually in python!
sentdex
3D Bar Charts in Python and Matplotlib
sentdex
3D Plane wire frame Graph Chart in Python
sentdex
Raspberry Pi Part 1 Introduction
sentdex
Raspberry Pi Part 8: First Download and Update! (Firmware)
sentdex
Raspberry Pi Part 10: How to set up a Linux Web Server on your Pi
sentdex
Raspberry Pi Part 11: Remote Desktop
sentdex
Twitter Analysis: How to rank a user's influence
sentdex
GPIO Tutorial for Pi Part 2 - Programming the GPIO
sentdex
GPIO Tutorial for Raspberry Pi Part 1 - Setting up
sentdex
More on: API Design
View skill →Related Reads
📰
📰
📰
📰
AI Weekly — 2026-06-26 to 2026-07-03 | Curated Surfaces, Sovereign Bets
Dev.to · Yang Goufang
Sora Is Shutting Down: The 6 Best Alternatives in 2026 (Ranked)
Medium · AI
Qualcomm Just Tried to Buy Nvidia’s Biggest Threat. Then Everything Fell Apart.
Medium · Data Science
Would You Take $85,000 From the Company Warning AI Might Take Your Job?
Medium · AI
🎓
Tutor Explanation
DeepCamp AI