Python MD5 implementation

mCoding · Beginner ·⚡ Algorithms & Data Structures ·4y ago

Key Takeaways

The video discusses the MD5 hash algorithm, its implementation in Python, and its vulnerabilities to various attacks, including chosen prefix, second pre-image, and full pre-image attacks, highlighting its brokenness for cryptographic purposes.

Full Transcript

hello and welcome i'm james murphy and today we're talking about the md5 hashing algorithm md5 stands for message digest 5 and what it does is it takes in a message and outputs a summary called a digest of the message change the message by even one character and the digest is supposed to be completely different the message can be any length but the digest will always be exactly 128 bits which we think of as four 32-bit integers md5 is what's called a hash function which really just means it takes messages of any length to a fixed length but md5 wasn't intended to just be a hash function it was intended to be a secure hash function among other things you aren't supposed to be able to find inputs with a given hash or even two inputs with the same hash unfortunately md5 is not a secure hash function it's completely broken for cryptographic use but it's still important from a historical perspective so let's take a look at it see an implementation of it and then discuss some real world attacks against it but before we do do you know who's not going to use a broken hash function like md5 today's sponsor hosting her tri hostinger's premium shared hosting plan which has 99.9 guaranteed uptime one free domain and free ssl certificate and a 30-day money-back guarantee use coupon code m coding in all caps at checkout to get up to 91 off on all yearly plans okay so let's just dive into the code what i have here is an implementation of the md5 algorithm the first thing that we need to do is decide what does md5 take in and what does it spit out so what i decided is that we're taking in bytes and we're spitting out bytes the way that we're going to do this is basically we make a state object which is going to keep track of all the information of the algorithm we feed in whatever bytes that we have and then we're going to have to do some finalization and the reason for this is because the md5 works on blocks that are 512 bits long 64 bytes so if your input string was not the correct length then it has to get padded and that's going to happen in this finalization step so down here i have basically the same thing but instead i'm directly passing in a file handle we see that the md5 state object is keeping track of four things the most important one is this state variable at the end this state is going to be what is ultimately made into the digest the way that md5 works is it operates on chunks of 512 bits at a time so the first thing is that of course we need to take in exactly 64 bytes and then we convert those bytes to integers so we take the whole chunk and we convert 64 bytes into 16 integers then we start with the process of updating the state so we have four variables a b c d that make up our state variable then we're going to go through 64 times and basically mix in the message in a particular way so the general way that this works is for each of the 64 steps there is a mixer for that step that's going to mix up bits and then we're also going to mix up which of the bytes that we use from the message the 64 steps of the algorithm are split into four rounds of 16. in the first round we use the f mixer in the second round we use the g mixer third round h mixer fourth round i mixer so this mixer for step whatever is going to just tell you which of the fghi is going to be used for that step in the algorithm similarly we're not going to read the message in order and so in the first round you read the 16 integers in this order in the second round we read it in a different order and so we take the mixer for a given step and we take the you know index which integer do we want to read from the message at each step and we just put both of those into big lists so that for every step we know which int we're supposed to be reading out of this array and which mixing function we're supposed to use so for all 64 steps we pick the correct mixer we pick the correct index for the message that we're supposed to be using and then we just use this formula we take a and then we throw bc and d in with the mixer and then add that in and then we add the message at the specified index in and then we add in this sign randomness this is just another thing that's just thrown in just because the algorithm says two okay so back to the main loop we mix our bits we rotate our bits then we add a to b and then we rotate we take a b c d and you can see we take the d and we put it in front here so then we have d abc that's all there is to it after we mix all of our bits up in this way 64 times then whatever our state variables are now we add them to what they were at the beginning so how do we ensure that we actually do this for every 512 bits in the message well that's what the process function is for so here's what we do in a loop we try to read however many bytes it takes to fill up our buffer and make a full block size number of bytes if our buffer was empty before and then i read in a full block size then i just go ahead and call compress on the buffer and then increase the length otherwise i already had some bytes in my buffer or i didn't read enough bytes in order to fill up the buffer so what i do in that case is i note that you know i've now filled up more bytes and if i filled up now enough to do a compression if i filled up the whole buffer then i go ahead and compress and increase the length and then set the number of fill bytes back to zero otherwise i still haven't read enough bytes in order to do another compression so i just keep reading bytes until i get to the end of the stream but what happens when i get to the end of the stream and i've read some bytes but not enough to fill up the entire 512 bits to use a compression on it well then we use this finalize method to basically pad the end what we have to do is we append a one bit onto the end of the message and then we pad it with zeros in all cases we're going to pad with zeros and append the length of the message that we just read as a 64-bit integer onto the end of the message but if adding that length onto the end is going to overflow the block size then we actually need to do an extra compression so we might have to add on more zeros compress and then add on the length but in any case we're basically just padding with exactly enough zeros so that once we get to the end of the message we can add on our one bit our zero padding and then our eight bytes representing the length so that it comes out exactly to a multiple of 512 bits and we do one last compression once we finalize the message then we just need to output the digest or hex digest and the way this works is it's literally just these state variables whatever we updated them to that's now what we output as the digest so why do we say that md5 is a completely broken hash function at least for cryptographic purposes well it's susceptible to a number of attacks and for one here is a collision here's two different messages that have the exact same md5 hash i know they look very very similar and they are but they actually differ in two places zero zero zero two and d5 versus five five there are a bunch of different types of attacks against hash algorithms if you say you have a collision attack against a hashing algorithm then you're just saying you're in the situation like i showed before with these two messages where you just have two different messages that have the same hash and that's not necessarily that useful for an exploit what would be more useful to an attacker is something like a chosen prefix attack it's given this name because you can pick any two prefixes you want like this one says you know launch the nukes and this one says don't launch the nukes and then you can add on extra bytes different sets of bytes to those two messages and then whatever suffix you want after that such that those two things have the same hash already in the real world there have been several instances where either researchers or real hackers were able to use something like a chosen prefix attack on md5 one instance they had a microsoft code signature for one of their binaries and they had microsoft sign something that was completely innocuous but that completely innocuous thing had the same md5 hash as a virus so whenever you're allowed to control the prefixes like that that's the case where you can really start doing some damage okay so these last two attacks are just theoretical for md5 there's no publicly known actual implementation that can do either of these things so a second pre-image attack is basically you've got a file or message that already exists and you want to find another message with the same hash you can imagine you have an operating system a linux image that lots of people are downloading and a hacker could switch it out for a different image that has the same hash and a full pre-image attack is basically the same thing but you don't even need the file all you need is the hash of the file and that's all i've got thanks for making it to the end don't forget to like comment subscribe and check out our sponsor in the link below

Original Description

MD5 is important but broken hash algorithm. Try Hostinger: https://hostinger.com/mcoding Use coupon code MCODING at checkout for up to 91% off all yearly hosting plans! It was intended to be used for cryptographic purposes, but a plethora of attacks against it have been discovered over the years that bypass the security properties MD5 was thought to have. In this video we look at the description, Python implementation, and some types of attacks against the MD5 algorithm. ― mCoding with James Murphy (https://mcoding.io) Source code: https://github.com/mCodingLLC/VideosSampleCode MD5 definition (RFC 6151): https://datatracker.ietf.org/doc/html/rfc6151 MD5 wiki: https://en.wikipedia.org/wiki/MD5 Best public cryptanalysis of MD5: https://eprint.iacr.org/2013/170.pdf SUPPORT ME ⭐ --------------------------------------------------- Patreon: https://patreon.com/mCoding Paypal: https://www.paypal.com/donate/?hosted_button_id=VJY5SLZ8BJHEE Other donations: https://mcoding.io/donate Top patrons and donors: John M, Laura M, Pieter G, Vahnekie, Sigmanificient BE ACTIVE IN MY COMMUNITY 😄 --------------------------------------------------- Discord: https://discord.gg/Ye9yJtZQuN Github: https://github.com/mCodingLLC/ Reddit: https://www.reddit.com/r/mCoding/ Facebook: https://www.facebook.com/james.mcoding CHAPTERS --------------------------------------------------- 0:00 Intro 1:34 Implementation 7:55 Collision and attacks
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from mCoding · mCoding · 38 of 60

1 Goodbye, List! Type hinting standard collections - New in Python 3.9
Goodbye, List! Type hinting standard collections - New in Python 3.9
mCoding
2 Python's comma equals ,= operator?
Python's comma equals ,= operator?
mCoding
3 Finding Primes in Python with the Sieve of Eratosthenes
Finding Primes in Python with the Sieve of Eratosthenes
mCoding
4 Find the First Missing Positive Int | Hard Interview Question on LeetCode
Find the First Missing Positive Int | Hard Interview Question on LeetCode
mCoding
5 JSON Tutorial Python | Basic Python Recipes
JSON Tutorial Python | Basic Python Recipes
mCoding
6 Simulating Brownian Motion in Python
Simulating Brownian Motion in Python
mCoding
7 The Single Most Useful Decorator in Python
The Single Most Useful Decorator in Python
mCoding
8 The Fastest Way to Loop in Python - An Unfortunate Truth
The Fastest Way to Loop in Python - An Unfortunate Truth
mCoding
9 Numpy Array Broadcasting In Python Explained
Numpy Array Broadcasting In Python Explained
mCoding
10 Brownian Motion Single Path Zoom
Brownian Motion Single Path Zoom
mCoding
11 Brownian Motion Fractal Zoom
Brownian Motion Fractal Zoom
mCoding
12 Magic Methods - Making Python builtins work with your classes
Magic Methods - Making Python builtins work with your classes
mCoding
13 50 Million Primes In 5 Seconds - Segmented Sieve of Eratosthenes
50 Million Primes In 5 Seconds - Segmented Sieve of Eratosthenes
mCoding
14 The Hottest New Feature Coming In Python 3.10 - Structural Pattern Matching / Match Statement
The Hottest New Feature Coming In Python 3.10 - Structural Pattern Matching / Match Statement
mCoding
15 How Fast is Python's Sort? Performance Testing
How Fast is Python's Sort? Performance Testing
mCoding
16 C++ First Missing Int, faster than 100%!
C++ First Missing Int, faster than 100%!
mCoding
17 [April Fools 2021] Python 4.0! New old print, mandatory static typing, StackOverflow integration
[April Fools 2021] Python 4.0! New old print, mandatory static typing, StackOverflow integration
mCoding
18 Python dataclasses will save you HOURS, also featuring attrs
Python dataclasses will save you HOURS, also featuring attrs
mCoding
19 C++ Sudoku Solver in 7 minutes using Recursive Backtracking
C++ Sudoku Solver in 7 minutes using Recursive Backtracking
mCoding
20 Every PROOF you've seen that .999... = 1 is WRONG
Every PROOF you've seen that .999... = 1 is WRONG
mCoding
21 Python's sharpest corner is ... plus equals? (+=)
Python's sharpest corner is ... plus equals? (+=)
mCoding
22 Binary Search - A Different Perspective | Python Algorithms
Binary Search - A Different Perspective | Python Algorithms
mCoding
23 The Best Way to Check for Optional Arguments in Python
The Best Way to Check for Optional Arguments in Python
mCoding
24 Local and Global Variable Lookup Weirdness in Python
Local and Global Variable Lookup Weirdness in Python
mCoding
25 Efficient Exponentiation
Efficient Exponentiation
mCoding
26 How To Install Python for Data Science
How To Install Python for Data Science
mCoding
27 0.1 + 0.2 is NOT 0.3 in Most Programming Languages
0.1 + 0.2 is NOT 0.3 in Most Programming Languages
mCoding
28 Python 3.10's new type hinting features
Python 3.10's new type hinting features
mCoding
29 Python 3.10's Quality of Life improvements
Python 3.10's Quality of Life improvements
mCoding
30 Introducing mZips! Python Zip and Zip Longest
Introducing mZips! Python Zip and Zip Longest
mCoding
31 Match statement tips
Match statement tips
mCoding
32 Using except: is a HUGE mistake
Using except: is a HUGE mistake
mCoding
33 Python + YouTube API | Automating descriptions
Python + YouTube API | Automating descriptions
mCoding
34 Anaphones, phonetic anagrams
Anaphones, phonetic anagrams
mCoding
35 Cracking passwords using ONLY response times | Secure Python
Cracking passwords using ONLY response times | Secure Python
mCoding
36 Python f-strings can do more than you thought. f'{val=}', f'{val!r}', f'{dt:%Y-%m-%d}'
Python f-strings can do more than you thought. f'{val=}', f'{val!r}', f'{dt:%Y-%m-%d}'
mCoding
37 Diagnose slow Python code. (Feat. async/await)
Diagnose slow Python code. (Feat. async/await)
mCoding
Python MD5 implementation
Python MD5 implementation
mCoding
39 Salting, peppering, and hashing passwords
Salting, peppering, and hashing passwords
mCoding
40 x to bool conversion in Python, C++, and C
x to bool conversion in Python, C++, and C
mCoding
41 You should put this in all your Python scripts | if __name__ == '__main__': ...
You should put this in all your Python scripts | if __name__ == '__main__': ...
mCoding
42 Find the Skyline Problem with C++ Solution Explained
Find the Skyline Problem with C++ Solution Explained
mCoding
43 The ONLY C keyword with no C++ equivalent
The ONLY C keyword with no C++ equivalent
mCoding
44 Should you use "not not x" instead of "bool(x)" in Python? (NO!)
Should you use "not not x" instead of "bool(x)" in Python? (NO!)
mCoding
45 Multiple Assignments in Python
Multiple Assignments in Python
mCoding
46 Why I don't like Python's chained comparisons
Why I don't like Python's chained comparisons
mCoding
47 Automated Testing in Python with pytest, tox, and GitHub Actions
Automated Testing in Python with pytest, tox, and GitHub Actions
mCoding
48 You can pip install directly from GitHub
You can pip install directly from GitHub
mCoding
49 __new__ vs __init__ in Python
__new__ vs __init__ in Python
mCoding
50 Metaclasses in Python
Metaclasses in Python
mCoding
51 The easy way to keep your repos tidy.
The easy way to keep your repos tidy.
mCoding
52 Which Python @dataclass is best? Feat. Pydantic, NamedTuple, attrs...
Which Python @dataclass is best? Feat. Pydantic, NamedTuple, attrs...
mCoding
53 Python __slots__ and object layout explained
Python __slots__ and object layout explained
mCoding
54 C++ cache locality and branch predictability
C++ cache locality and branch predictability
mCoding
55 Avoiding import loops in Python
Avoiding import loops in Python
mCoding
56 25 nooby Python habits you need to ditch
25 nooby Python habits you need to ditch
mCoding
57 Python staticmethod and classmethod
Python staticmethod and classmethod
mCoding
58 Building a Python app with Anvil to email me if my website goes down (includes paid features)
Building a Python app with Anvil to email me if my website goes down (includes paid features)
mCoding
59 31 nooby C++ habits you need to ditch
31 nooby C++ habits you need to ditch
mCoding
60 Interviewing the creator of C++, Bjarne Stroustrup
Interviewing the creator of C++, Bjarne Stroustrup
mCoding

This video teaches the implementation of the MD5 hash algorithm in Python and discusses its vulnerabilities to various attacks, making it unsuitable for cryptographic purposes. Viewers will learn about the mathematical concepts behind hashing algorithms and how to apply them in practice.

Key Takeaways
  1. Take in exactly 64 bytes and convert them to integers
  2. Update the state by mixing in the message in a particular way using a mixer for each step
  3. Use the f mixer in the first round, g mixer in the second round, h mixer in the third round, and i mixer in the fourth round
  4. Read bytes from the message and put them into big lists
  5. Pick the correct mixer and index for the message
  6. Use the formula to mix and rotate bits
  7. Add the message at the specified index
  8. Add sign randomness
💡 The MD5 hash algorithm is broken for cryptographic purposes due to its susceptibility to various attacks, including chosen prefix, second pre-image, and full pre-image attacks.

Related AI Lessons

Bloom Filters, Explained Properly
Learn how Bloom filters work and their benefits, including tiny memory and blazing speed, in exchange for potential false positives.
Dev.to · Daksh Gargas
Prefix Sums: The Preprocessing Trick That Makes Range Queries Instant
Learn how prefix sums enable instant range queries in arrays, boosting performance in various applications
Medium · Programming
I Thought I Was Ready for the Interview — Then One Simple Math Question Destroyed Me
A simple math question can destroy a developer's interview, highlighting the importance of being prepared for unexpected questions
Medium · Programming
Week 2(Day 10): LeetCode Two Pointers(slow & fast): Remove Duplicates from Sorted Array (Brute…
Learn to remove duplicates from a sorted array using the two pointers technique, improving from brute force to optimized solutions
Medium · Python

Chapters (3)

Intro
1:34 Implementation
7:55 Collision and attacks
Up next
Stump Grinder Carbide Wheel Grinds Hardwood To Chips
Innoforge Studio
Watch →