Python MD5 implementation
Skills:
ML Maths Basics80%
Key Takeaways
The video discusses the MD5 hash algorithm, its implementation in Python, and its vulnerabilities to various attacks, including chosen prefix, second pre-image, and full pre-image attacks, highlighting its brokenness for cryptographic purposes.
Full Transcript
hello and welcome i'm james murphy and today we're talking about the md5 hashing algorithm md5 stands for message digest 5 and what it does is it takes in a message and outputs a summary called a digest of the message change the message by even one character and the digest is supposed to be completely different the message can be any length but the digest will always be exactly 128 bits which we think of as four 32-bit integers md5 is what's called a hash function which really just means it takes messages of any length to a fixed length but md5 wasn't intended to just be a hash function it was intended to be a secure hash function among other things you aren't supposed to be able to find inputs with a given hash or even two inputs with the same hash unfortunately md5 is not a secure hash function it's completely broken for cryptographic use but it's still important from a historical perspective so let's take a look at it see an implementation of it and then discuss some real world attacks against it but before we do do you know who's not going to use a broken hash function like md5 today's sponsor hosting her tri hostinger's premium shared hosting plan which has 99.9 guaranteed uptime one free domain and free ssl certificate and a 30-day money-back guarantee use coupon code m coding in all caps at checkout to get up to 91 off on all yearly plans okay so let's just dive into the code what i have here is an implementation of the md5 algorithm the first thing that we need to do is decide what does md5 take in and what does it spit out so what i decided is that we're taking in bytes and we're spitting out bytes the way that we're going to do this is basically we make a state object which is going to keep track of all the information of the algorithm we feed in whatever bytes that we have and then we're going to have to do some finalization and the reason for this is because the md5 works on blocks that are 512 bits long 64 bytes so if your input string was not the correct length then it has to get padded and that's going to happen in this finalization step so down here i have basically the same thing but instead i'm directly passing in a file handle we see that the md5 state object is keeping track of four things the most important one is this state variable at the end this state is going to be what is ultimately made into the digest the way that md5 works is it operates on chunks of 512 bits at a time so the first thing is that of course we need to take in exactly 64 bytes and then we convert those bytes to integers so we take the whole chunk and we convert 64 bytes into 16 integers then we start with the process of updating the state so we have four variables a b c d that make up our state variable then we're going to go through 64 times and basically mix in the message in a particular way so the general way that this works is for each of the 64 steps there is a mixer for that step that's going to mix up bits and then we're also going to mix up which of the bytes that we use from the message the 64 steps of the algorithm are split into four rounds of 16. in the first round we use the f mixer in the second round we use the g mixer third round h mixer fourth round i mixer so this mixer for step whatever is going to just tell you which of the fghi is going to be used for that step in the algorithm similarly we're not going to read the message in order and so in the first round you read the 16 integers in this order in the second round we read it in a different order and so we take the mixer for a given step and we take the you know index which integer do we want to read from the message at each step and we just put both of those into big lists so that for every step we know which int we're supposed to be reading out of this array and which mixing function we're supposed to use so for all 64 steps we pick the correct mixer we pick the correct index for the message that we're supposed to be using and then we just use this formula we take a and then we throw bc and d in with the mixer and then add that in and then we add the message at the specified index in and then we add in this sign randomness this is just another thing that's just thrown in just because the algorithm says two okay so back to the main loop we mix our bits we rotate our bits then we add a to b and then we rotate we take a b c d and you can see we take the d and we put it in front here so then we have d abc that's all there is to it after we mix all of our bits up in this way 64 times then whatever our state variables are now we add them to what they were at the beginning so how do we ensure that we actually do this for every 512 bits in the message well that's what the process function is for so here's what we do in a loop we try to read however many bytes it takes to fill up our buffer and make a full block size number of bytes if our buffer was empty before and then i read in a full block size then i just go ahead and call compress on the buffer and then increase the length otherwise i already had some bytes in my buffer or i didn't read enough bytes in order to fill up the buffer so what i do in that case is i note that you know i've now filled up more bytes and if i filled up now enough to do a compression if i filled up the whole buffer then i go ahead and compress and increase the length and then set the number of fill bytes back to zero otherwise i still haven't read enough bytes in order to do another compression so i just keep reading bytes until i get to the end of the stream but what happens when i get to the end of the stream and i've read some bytes but not enough to fill up the entire 512 bits to use a compression on it well then we use this finalize method to basically pad the end what we have to do is we append a one bit onto the end of the message and then we pad it with zeros in all cases we're going to pad with zeros and append the length of the message that we just read as a 64-bit integer onto the end of the message but if adding that length onto the end is going to overflow the block size then we actually need to do an extra compression so we might have to add on more zeros compress and then add on the length but in any case we're basically just padding with exactly enough zeros so that once we get to the end of the message we can add on our one bit our zero padding and then our eight bytes representing the length so that it comes out exactly to a multiple of 512 bits and we do one last compression once we finalize the message then we just need to output the digest or hex digest and the way this works is it's literally just these state variables whatever we updated them to that's now what we output as the digest so why do we say that md5 is a completely broken hash function at least for cryptographic purposes well it's susceptible to a number of attacks and for one here is a collision here's two different messages that have the exact same md5 hash i know they look very very similar and they are but they actually differ in two places zero zero zero two and d5 versus five five there are a bunch of different types of attacks against hash algorithms if you say you have a collision attack against a hashing algorithm then you're just saying you're in the situation like i showed before with these two messages where you just have two different messages that have the same hash and that's not necessarily that useful for an exploit what would be more useful to an attacker is something like a chosen prefix attack it's given this name because you can pick any two prefixes you want like this one says you know launch the nukes and this one says don't launch the nukes and then you can add on extra bytes different sets of bytes to those two messages and then whatever suffix you want after that such that those two things have the same hash already in the real world there have been several instances where either researchers or real hackers were able to use something like a chosen prefix attack on md5 one instance they had a microsoft code signature for one of their binaries and they had microsoft sign something that was completely innocuous but that completely innocuous thing had the same md5 hash as a virus so whenever you're allowed to control the prefixes like that that's the case where you can really start doing some damage okay so these last two attacks are just theoretical for md5 there's no publicly known actual implementation that can do either of these things so a second pre-image attack is basically you've got a file or message that already exists and you want to find another message with the same hash you can imagine you have an operating system a linux image that lots of people are downloading and a hacker could switch it out for a different image that has the same hash and a full pre-image attack is basically the same thing but you don't even need the file all you need is the hash of the file and that's all i've got thanks for making it to the end don't forget to like comment subscribe and check out our sponsor in the link below
Original Description
MD5 is important but broken hash algorithm.
Try Hostinger: https://hostinger.com/mcoding
Use coupon code MCODING at checkout for up to 91% off all yearly hosting plans!
It was intended to be used for cryptographic purposes, but a plethora of attacks against it have been discovered over the years that bypass the security properties MD5 was thought to have.
In this video we look at the description, Python implementation, and some types of attacks against the MD5 algorithm.
― mCoding with James Murphy (https://mcoding.io)
Source code: https://github.com/mCodingLLC/VideosSampleCode
MD5 definition (RFC 6151): https://datatracker.ietf.org/doc/html/rfc6151
MD5 wiki: https://en.wikipedia.org/wiki/MD5
Best public cryptanalysis of MD5: https://eprint.iacr.org/2013/170.pdf
SUPPORT ME ⭐
---------------------------------------------------
Patreon: https://patreon.com/mCoding
Paypal: https://www.paypal.com/donate/?hosted_button_id=VJY5SLZ8BJHEE
Other donations: https://mcoding.io/donate
Top patrons and donors:
John M, Laura M, Pieter G, Vahnekie, Sigmanificient
BE ACTIVE IN MY COMMUNITY 😄
---------------------------------------------------
Discord: https://discord.gg/Ye9yJtZQuN
Github: https://github.com/mCodingLLC/
Reddit: https://www.reddit.com/r/mCoding/
Facebook: https://www.facebook.com/james.mcoding
CHAPTERS
---------------------------------------------------
0:00 Intro
1:34 Implementation
7:55 Collision and attacks
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from mCoding · mCoding · 38 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
▶
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Goodbye, List! Type hinting standard collections - New in Python 3.9
mCoding
Python's comma equals ,= operator?
mCoding
Finding Primes in Python with the Sieve of Eratosthenes
mCoding
Find the First Missing Positive Int | Hard Interview Question on LeetCode
mCoding
JSON Tutorial Python | Basic Python Recipes
mCoding
Simulating Brownian Motion in Python
mCoding
The Single Most Useful Decorator in Python
mCoding
The Fastest Way to Loop in Python - An Unfortunate Truth
mCoding
Numpy Array Broadcasting In Python Explained
mCoding
Brownian Motion Single Path Zoom
mCoding
Brownian Motion Fractal Zoom
mCoding
Magic Methods - Making Python builtins work with your classes
mCoding
50 Million Primes In 5 Seconds - Segmented Sieve of Eratosthenes
mCoding
The Hottest New Feature Coming In Python 3.10 - Structural Pattern Matching / Match Statement
mCoding
How Fast is Python's Sort? Performance Testing
mCoding
C++ First Missing Int, faster than 100%!
mCoding
[April Fools 2021] Python 4.0! New old print, mandatory static typing, StackOverflow integration
mCoding
Python dataclasses will save you HOURS, also featuring attrs
mCoding
C++ Sudoku Solver in 7 minutes using Recursive Backtracking
mCoding
Every PROOF you've seen that .999... = 1 is WRONG
mCoding
Python's sharpest corner is ... plus equals? (+=)
mCoding
Binary Search - A Different Perspective | Python Algorithms
mCoding
The Best Way to Check for Optional Arguments in Python
mCoding
Local and Global Variable Lookup Weirdness in Python
mCoding
Efficient Exponentiation
mCoding
How To Install Python for Data Science
mCoding
0.1 + 0.2 is NOT 0.3 in Most Programming Languages
mCoding
Python 3.10's new type hinting features
mCoding
Python 3.10's Quality of Life improvements
mCoding
Introducing mZips! Python Zip and Zip Longest
mCoding
Match statement tips
mCoding
Using except: is a HUGE mistake
mCoding
Python + YouTube API | Automating descriptions
mCoding
Anaphones, phonetic anagrams
mCoding
Cracking passwords using ONLY response times | Secure Python
mCoding
Python f-strings can do more than you thought. f'{val=}', f'{val!r}', f'{dt:%Y-%m-%d}'
mCoding
Diagnose slow Python code. (Feat. async/await)
mCoding
Python MD5 implementation
mCoding
Salting, peppering, and hashing passwords
mCoding
x to bool conversion in Python, C++, and C
mCoding
You should put this in all your Python scripts | if __name__ == '__main__': ...
mCoding
Find the Skyline Problem with C++ Solution Explained
mCoding
The ONLY C keyword with no C++ equivalent
mCoding
Should you use "not not x" instead of "bool(x)" in Python? (NO!)
mCoding
Multiple Assignments in Python
mCoding
Why I don't like Python's chained comparisons
mCoding
Automated Testing in Python with pytest, tox, and GitHub Actions
mCoding
You can pip install directly from GitHub
mCoding
__new__ vs __init__ in Python
mCoding
Metaclasses in Python
mCoding
The easy way to keep your repos tidy.
mCoding
Which Python @dataclass is best? Feat. Pydantic, NamedTuple, attrs...
mCoding
Python __slots__ and object layout explained
mCoding
C++ cache locality and branch predictability
mCoding
Avoiding import loops in Python
mCoding
25 nooby Python habits you need to ditch
mCoding
Python staticmethod and classmethod
mCoding
Building a Python app with Anvil to email me if my website goes down (includes paid features)
mCoding
31 nooby C++ habits you need to ditch
mCoding
Interviewing the creator of C++, Bjarne Stroustrup
mCoding
More on: ML Maths Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Bloom Filters, Explained Properly
Dev.to · Daksh Gargas
Prefix Sums: The Preprocessing Trick That Makes Range Queries Instant
Medium · Programming
I Thought I Was Ready for the Interview — Then One Simple Math Question Destroyed Me
Medium · Programming
Week 2(Day 10): LeetCode Two Pointers(slow & fast): Remove Duplicates from Sorted Array (Brute…
Medium · Python
Chapters (3)
Intro
1:34
Implementation
7:55
Collision and attacks
🎓
Tutor Explanation
DeepCamp AI