File I/O With Memory Mapping Using Python mmap

Real Python · Intermediate ·💻 AI-Assisted Coding ·4y ago

Key Takeaways

The video demonstrates the use of Python's mmap library for memory-mapped file input and output, providing significant performance improvements in code that requires a lot of file I/O, with examples and comparisons to vanilla file operations.

Full Transcript

welcome to python mmap doing file i o with memory mapping my name is christopher and i will be your guide this course is about the map library a wrapper to a fairly low level operating system call that maps the contents of a file into memory in this course you will learn about how to map a file on disk into a memory block and why you might want to do that reading and writing to and from said memory block and how to use this same library for sharing memory between processes a quick note on versions all code demonstrated here was tested with python 310 on macos if you've taken one of my courses before you'll know what i don't usually bother telling you about the operating system in this case the map library has some operating system specific variations so my os is a bit more important don't worry you'll be able to follow along either way i mostly stick to the stuff that is common to all operating systems and will point out the differences between unix-like worlds and windows worlds when they're important a map has been around for a long time and pretty much mimics the underlying call in c that it is based upon so that whole python 310 thing isn't too important i do use f strings in some demo code but otherwise this could go back to the dawn of python time amp map is based on a very low level call to your operating system which maps directly to a call in a c library for system memory management its purpose is to map the contents of a file on disk into a block of memory so that anything you do to that block of memory is reflected in the file that's an over simplification there are multiple modes of doing work but the primary purpose typically is right to the block of memory have it reflected in the file why would you do this in the python world usually when you're mucking about with a file you're loading into some sort of python object representation this is typically either a string a byte buffer or something similar direct memory mapping is a lot closer to the underlying file there's no intermediate representation this often means you can get a performance boost it usually means less memory because the python object requires multiple copies of things going into memory for example the file might get buffered when it is read before being put into the python object and in most cases you may also get a speed improvement since this is tied so close to an os call the performance boost in memory usage and speed is tied directly to the os that means the boost you get may be different than the boost i get or worse might be different between subsequent calls due to things like caching in addition to all of this the map library can also be used to share memory between processes next up i'll give a bit of background into the inner workings of your computer and how that affects memory and file i o if cpus memory and files are your bread and butter feel free to skip this one in the previous lesson i gave an overview of the course in this lesson i'll cover background on memory and file i o in your computer and all the stuff you'll be glad your operating system abstracts away for you first off let's talk about performance your computer can be broken down into four basic concepts the cpu where the computation takes place the memory which acts like a kind of chalkboard that temporarily holds things that are too big to fit in the cpu at a time a storage unit like a disk drive and not pictured here and interface to peripherals like your video card there are huge performance differences between each of these areas i know i exaggerated the huge in that last sentence but it truly is kind of mind-blowing see that dot that dot represents a nanosecond that's one billionth of a second to try and imagine that light travels a whole 30 centimeters or just about a foot in that amount of time your cpu is so fast that a middling intel i7 from a few years ago could comfortably do a hundred instructions in that time now instructions in a cpu are basic building blocks so that's not 100 lines of code more like 100 multiplications but that's still a lot your cpu has little tiny memory things inside of it called registers this is where it stores stuff it is operating on but there aren't a lot of registers in most cpus and so there is a constant shuffle to memory to get stuff to fill the registers just accessing a spot in memory takes about 100 nanoseconds the dot above is a single pixel this line is 100 pixels long you see what i'm doing instead of just accessing memory let's read a bunch of it grabbing a megabyte of data takes about 3000 nanoseconds that's 3000 pixels of lines there this is something every programmer should have a basic understanding about going out to memory is significantly more expensive than doing something directly on the processor and you ain't seen nothing yet let's take a thousand nanoseconds that's a bunch of those red lines and compress them down into that tiny yellow dot i'm back to a single pixel you thought going out to memory was slow well going to disk is much worse yep you read that right 825 000 nanoseconds to read that same megabyte block from disk that's a three order of magnitude difference it isn't relevant to this course but there is worse the network is even worse locality in computing is tied to huge performance gains if you can keep everything in memory it can make a big difference if you can keep everything on the cpu even better all these numbers are just rough because of these differences your hardware will have caches between these boundaries to help improve performance that makes measuring things a bit weird you'll get different results between the first and second time of doing anything that's those caches the difference is so stark that most cpus have multiple levels of caches on them to help avoid going out to memory too frequently consider the relatively simple case of adding variables together variables in your program are stored in memory the active edition is done on the cpu to do the addition the variables have to be read from memory put into registers on the cpu then the cpu does the addition typically putting the result in a third register although some hardware uses two registers and overwrites one of them then to put the result in a third variable it has to be written back out to memory think back to the latency values from before let's simplify this and say that the two variables can be read in a single axis an access costs 100 nanoseconds in one nanosecond the cpu can do a hundred instructions that means the cpu could do over ten thousand additions in the time it takes to just fetch the variables from memory into the registers i'm sure you can guess where this is going a molasses-like pace of disc reading is in your future all right let's change it up a bit and talk about different kinds of memory up until now i've been talking about the physical memory in your hardware most likely ram there's only so much ram on your machine and every process running wants some of it so your operating system abstracts this away as virtual memory when a program wants memory it is given virtual memory which could currently be in ram or on the disk the os swaps the contents in and out of ram from a swap file this enables all the programs on your machine to use more memory than physically available the os simply puts some of it down to the disk when ram is tight if your os is smart it typically makes this decision based on something not being used right now this is why computers seem to go from zipping along quickly to chugging at a horrible pace when two or more processes are fighting for most of the ram the os has to frequently swap the memory in and out of the disk as disks are three orders of magnitude slower than memory it shouldn't be a wonder that you notice some performance difference thankfully all of this is done by the os and you don't have to manage it yourself being aware of the consequences of asking for a lot of memory can make you a better programmer or at least a programmer whose users are less cranky about the sluggishness of their machine another memory concept is shared memory the simplest model is for each program that you run to be contained inside of a process each process is managed by the os and among other things is allocated some memory for safety reasons this is self-contained you wouldn't want my process writing all over your process is memory that'd be bad your program can actually have multiple processes there are a variety of reasons for doing this but most of them have to do with trying to do more than one thing at a time since memory is allocated to a process you need a special situation to share memory between two processes this is another feature offered by your operating system and it is called logically enough a shared memory block okay you're an expert on memory now how about that storage stuff consider this bit of code which reads all the contents of a file and puts it in a variable named text which of course lives in memory to read the file you have to give out control of your program to the os by making a system call then the os interacts with the disk and it buffers the data from the disk before putting it into memory this is a vast over simplification in fact that code there is going to break down into at least two system calls one for opening the file the other for reading but even that reading is more complicated there are file pointers that need to move there are buffers that need filling in fact how many system calls there will be will be partially dependent on the size of the file being read and i know this seems like it might be an obvious observation but changing the variable named text will do nothing to the file if you want to change the file you have to do that whole process again but writing things down to disk instead of reading why am i making obvious statements well map does things differently and you'll learn all about that in the next lesson in the previous lesson i explained the vast difference in speed between the various parts of your computer gave an overview on different kinds of memory and briefly touched on file io in this lesson i'll show you how all those things come together when you use the mmap module to map file contents into memory blocks if you were to write a function that did edits to a file that function would likely read the file into a python object a string a byte array or something similar make changes to the object then serialize that object back onto the disk it might be into the same file overwriting its contents it might be to a new file or to a temp file that then gets renamed but the results would be similar new bytes on disk instead the map module provides an alternative way it reads the file into a block of memory which is abstracted by an map object then operates directly on that object meaning both the memory representation and the disk representation change this is both kind of simpler and kind of more complicated you've got less steps happening so you might get a performance gain but you're a little more restricted on what kinds of things you can do let's go play with map in the repple and i'll show you what i mean in the top window here i have a simple function that reads a file and reports how many characters were read all of this is being done inside of a context manager that's the with statement so that the file automatically will be closed upon exiting the context block now into the repl in the bottom window i've put a file name into a variable this kyote file is a text file containing 10 copies of the history of don quixote pulled off of project gutenberg all told it's 24 megabytes of data uncompressed it's 10 copies because i wanted something a bit larger than a single copy now i'll import the function in the top window and call it so far so good 23 million characters now let's look at the map equivalent new function up top the first thing you'll probably notice is there are two context managers here like before the file to be read is opened that's line 5. the new bit is where the file handle from the open file is used in a call to create an map object from the map module like files this object has to be closed so like files it gets put in a context manager to make sure everything is cleaned up automatically map doesn't use a file handle it uses a file number which you can get from the file handle itself in addition to the file number it also takes a size and an access flag giving a length of zero like i did here you will get back a block of memory the same size as the file being mapped the access flag is similar to the mode indicator in opening a file i'll go into much more detail about this flag later inside of the map context block i'm doing pretty much the same thing as i did in the file read function i'm reading the whole thing into a variable then figuring out how long it is all right let's do this imported it in and called it pretty similar you'll notice the amount of data looks different map objects represent bytes not strings python strings are in unicode and that means they may take up more than a single byte for a character that's why the size is different one of the key reasons for using map over vanilla file operations is performance but there's a big asterisk beside that special offer let's time the two functions and see the difference i'm going to import the time it library to do the timings and now i'll time the file read function using time it the function got run three times returning the results in the list printed at the bottom here pretty consistent 0.067 seconds twice and slightly faster the third time let's do it again with map that's all over the shop isn't it the first time is far worse than the vanilla code but the second and third are significant improvements this is where it gets messy there are a bunch of variables impacting the outcome first you'll get different performance based on file size second you'll get different performance on different hardware due to what kinds of caches you have third is how your os has implemented the map call depressingly for me there is a known issue in the macos map call that makes it significantly slower than running linux on the same hardware a colleague of mine running the same code on windows was consistently getting 10 times improvement do note that what i'm doing here is just using mmap to read some data and stuff it into a python object although this might get you a performance boost it is still stuffing things into a python object depending on what you're doing you may be able to stay inside of the mapped block and that is where you'll see better gains more on that later as well in that little demo i yada yada did the whole characters and bytes thing let's dig into it a bit more the map call uses a byte array representation that means it sees everything as the bytes that make up the block regardless of what the data represents in the case of a unicode string a single character may be more than a single byte that means you have to be careful how you read or write your data the boundaries between characters might not be what you expect if you're dealing with text data that is pure ascii you can get away with a one character one byte assumption but otherwise need to be careful if you'd asked me before running the previous code i would have sworn the don quixote file was pure ascii but the character count didn't match the byte count so there's something in there outside of the ascii range over 70 kilobytes of something in this case let's go back into the rebel and see how this can mess you up i have three functions for you to compare the first one reads the file as text and prints out some data let me just import it and run it okay i ran it on monty.txt which has 39 characters of content the first character is n the sixth character is y and the whole string is nobody expects the spanish inquisition watch out i hear they tickle now for function number two one's similar to the text case but this time i'm reading the file as binary importing it running the new function on the same monty.txt file okay 39 bytes just like the 39 characters that'd be that ascii thing first byte is for e which is the hex code for capital n and the sixth byte is hex 79 which i then conveniently show as a character and which is still the letter y and finally printing it out you get a string representation of the bytes because this is a chunk of binary rather than a string python prints it using the byte notation a quoted value with a b prefix and you can see the new line at the end of it that pesky new line yeah it was there before but in the string version it caused the gap between the output and the next repel prompt subtle you could easily miss that let's try these two functions with some different data i have another file that i'm going to load looking at snake dot text as a string this time the length is 26 remember that's in characters the first character is a cute little snake the sixth is an e and the whole thing is filled with emoji goodness before showing you the binary i want to show you some info about the file this rather lengthy bit of code creates a path object calls the stat method on that path object and then gets the size value out of the resulting object the 39 here is how many bytes the file is monty.txt had 39 characters and was 39 bytes now let's try the snake file hmm it says 35 bytes but text pieces said it was 26 characters that's important the emoji in the file take up more than one byte each making the total different let's try using the byte reading function on snake.txt there are 35 bytes which maps what the path object said the first byte is f0 what's an f0 well the first character is the python emoji which takes up four bytes the f0 is just the first of those four the sixth byte is an ascii character though so you can see the m the string representation of this gets quite messy because the bytes are printed as well bytes instead of the ascii equivalents the hex f0 hex 9f hex 90 and hex 8d all get combined to make the snake character all right onto our third function this is a course on map after all the new function up top here accomplishes the same thing as the binary reader but it uses a map instead i'll import it and run it and you see the same kind of result as the binary reader to recap when you're using map you're in biteland everything you do is operating on a giant byte array if the data you're playing with is a unicode string you need to be careful there are even more monty python quotes coming your way next up i'll dive deeper into the map call and show you many of the operations you can do on your mapped block

Original Description

Python’s mmap provides memory-mapped file input and output (I/O). It allows you to take advantage of lower-level operating system functionality to read files as if they were one large string or array. This can provide significant performance improvements in code that requires a lot of file I/O. This is a portion of the complete course, which you can find here: https://realpython.com/courses/python-mmap-io/ The rest of the course covers: - mmap Operations - Accessing shared memory - How to change a portion of a file without rewriting the entire file - How to use mmap to share information between multiple processes
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Real Python · Real Python · 0 of 60

← Previous Next →
1 A better Python REPL – bpython vs python interpreter
A better Python REPL – bpython vs python interpreter
Real Python
2 Introducing large-type.com – A Utility Website
Introducing large-type.com – A Utility Website
Real Python
3 Reading Hacker News Without Wasting Tons of Time
Reading Hacker News Without Wasting Tons of Time
Real Python
4 Forward References and Python 3 Type Hints
Forward References and Python 3 Type Hints
Real Python
5 Using Sublime Text as your Git Editor
Using Sublime Text as your Git Editor
Real Python
6 Python Code Linting and Auto-Complete for Sublime Text
Python Code Linting and Auto-Complete for Sublime Text
Real Python
7 Make your Python Code More Readable with Custom Exceptions
Make your Python Code More Readable with Custom Exceptions
Real Python
8 Write Better Tests with Sublime Text's Split Layout Feature
Write Better Tests with Sublime Text's Split Layout Feature
Real Python
9 How to Use Sublime Text from the Command Line
How to Use Sublime Text from the Command Line
Real Python
10 Rename Variables with Multiple Selection in Sublime Text
Rename Variables with Multiple Selection in Sublime Text
Real Python
11 Sublime Text Settings for Writing PEP 8 Python
Sublime Text Settings for Writing PEP 8 Python
Real Python
12 Write Cleaner Python with Sublime Text's Indent Guides
Write Cleaner Python with Sublime Text's Indent Guides
Real Python
13 Sublime Text Whitespace Settings for Python Development
Sublime Text Whitespace Settings for Python Development
Real Python
14 Function Argument Unpacking in Python
Function Argument Unpacking in Python
Real Python
15 Python Code Review: Debugging and Refactoring "Conway's Game of Life" +  Automated Tests
Python Code Review: Debugging and Refactoring "Conway's Game of Life" + Automated Tests
Real Python
16 Using "get()" to Return a Default Value from a Python Dict
Using "get()" to Return a Default Value from a Python Dict
Real Python
17 A Python Shorthand for Swapping Two Variables
A Python Shorthand for Swapping Two Variables
Real Python
18 Python Code Review: Refactoring a Web Scraper, PEP 8 Style Guide Compliance, requirements.txt
Python Code Review: Refactoring a Web Scraper, PEP 8 Style Guide Compliance, requirements.txt
Real Python
19 Click & Jump to Test Failures from the Command Line (iTerm2)
Click & Jump to Test Failures from the Command Line (iTerm2)
Real Python
20 Setting up Sublime Text for Python Developers
Setting up Sublime Text for Python Developers
Real Python
21 Sublime Text + Python Guide Overview
Sublime Text + Python Guide Overview
Real Python
22 Python Code Review: Adding Pytest Tests to an Existing Python Web Scraper
Python Code Review: Adding Pytest Tests to an Existing Python Web Scraper
Real Python
23 Type-Checking Python Programs With Type Hints and mypy
Type-Checking Python Programs With Type Hints and mypy
Real Python
24 A Shorthand for Merging Dictionaries in Python 3.5+
A Shorthand for Merging Dictionaries in Python 3.5+
Real Python
25 Python Code Review Flask Web Security Tutorial + Virtualenvs, requirements.txt
Python Code Review Flask Web Security Tutorial + Virtualenvs, requirements.txt
Real Python
26 My Python Code Looks Ugly and Confusing – Help!
My Python Code Looks Ugly and Confusing – Help!
Real Python
27 Setting Up a Programmer Portfolio/Developer Blog – How To Get Started
Setting Up a Programmer Portfolio/Developer Blog – How To Get Started
Real Python
28 Do I Need a GitHub/GitLab/Bitbucket Profile as a Developer?
Do I Need a GitHub/GitLab/Bitbucket Profile as a Developer?
Real Python
29 Programmer Portfolio – Example and Walkthrough
Programmer Portfolio – Example and Walkthrough
Real Python
30 How to Get Your 1st Speaking Gig at a Tech Conference
How to Get Your 1st Speaking Gig at a Tech Conference
Real Python
31 How to Build Your Public Speaking Skills as a Developer
How to Build Your Public Speaking Skills as a Developer
Real Python
32 The Object-oriented Version of "Spaghetti Code" is "Lasagna Code" ?!
The Object-oriented Version of "Spaghetti Code" is "Lasagna Code" ?!
Real Python
33 Setting up Sublime Text for Python Developers – Lesson #1
Setting up Sublime Text for Python Developers – Lesson #1
Real Python
34 Cool New Features in Python 3.6
Cool New Features in Python 3.6
Real Python
35 "is" vs "==" in Python – What's the Difference? (And When to Use Each)
"is" vs "==" in Python – What's the Difference? (And When to Use Each)
Real Python
36 Emulating switch/case Statements in Python with Dictionaries
Emulating switch/case Statements in Python with Dictionaries
Real Python
37 Python Function Argument Unpacking Tutorial (* and ** Operators)
Python Function Argument Unpacking Tutorial (* and ** Operators)
Real Python
38 What Code Should I Put On My GitHub/GitLab/BitBucket Profile?
What Code Should I Put On My GitHub/GitLab/BitBucket Profile?
Real Python
39 A Crazy Python Dictionary Expression ?!
A Crazy Python Dictionary Expression ?!
Real Python
40 String Conversion in Python: When to Use __repr__ vs __str__
String Conversion in Python: When to Use __repr__ vs __str__
Real Python
41 Method Types in Python OOP: @classmethod, @staticmethod, and Instance Methods
Method Types in Python OOP: @classmethod, @staticmethod, and Instance Methods
Real Python
42 Optional Arguments in Python With *args and **kwargs
Optional Arguments in Python With *args and **kwargs
Real Python
43 Python Context Managers and the "with" Statement (__enter__ & __exit__)
Python Context Managers and the "with" Statement (__enter__ & __exit__)
Real Python
44 Installing Python Packages with pip and virtualenv / venv
Installing Python Packages with pip and virtualenv / venv
Real Python
45 "For Each" Loops in Python with enumerate() and range()
"For Each" Loops in Python with enumerate() and range()
Real Python
46 Python Code Review: LibreOffice Automation and the Python Standard Library
Python Code Review: LibreOffice Automation and the Python Standard Library
Real Python
47 Managing Python Dependencies With Pip and Virtual Environments – Lesson #1
Managing Python Dependencies With Pip and Virtual Environments – Lesson #1
Real Python
48 Python Tutorial: List Comprehensions Step-By-Step
Python Tutorial: List Comprehensions Step-By-Step
Real Python
49 Leveraging Python's Implicit "return None" Statements
Leveraging Python's Implicit "return None" Statements
Real Python
50 What's the meaning of underscores (_ & __) in Python variable names?
What's the meaning of underscores (_ & __) in Python variable names?
Real Python
51 Python Data Structures: Sets, Frozensets, and Multisets (Bags)
Python Data Structures: Sets, Frozensets, and Multisets (Bags)
Real Python
52 Writing automated tests for Python command-line apps and scripts
Writing automated tests for Python command-line apps and scripts
Real Python
53 How to find great Python packages on PyPI, the Python Package Repository
How to find great Python packages on PyPI, the Python Package Repository
Real Python
54 Immutable vs Mutable Objects in Python
Immutable vs Mutable Objects in Python
Real Python
55 PyPI vs Warehouse, the Next-Generation Python Package Repository
PyPI vs Warehouse, the Next-Generation Python Package Repository
Real Python
56 pep8.org — The Prettiest Way to View the PEP 8 Python Style Guide
pep8.org — The Prettiest Way to View the PEP 8 Python Style Guide
Real Python
57 My Experience at PyCon 2017 in Portland
My Experience at PyCon 2017 in Portland
Real Python
58 Pylint Tutorial – How to Write Clean Python
Pylint Tutorial – How to Write Clean Python
Real Python
59 "Reverse a List in Python" Tutorial: Three Methods & How-to Demos
"Reverse a List in Python" Tutorial: Three Methods & How-to Demos
Real Python
60 Python Refactoring: "while True" Infinite Loops & The "input" Function
Python Refactoring: "while True" Infinite Loops & The "input" Function
Real Python

This video teaches how to use Python's mmap library to improve file I/O performance by mapping files into memory, allowing for direct operations on the memory representation and the disk representation, with examples and comparisons to vanilla file operations.

Key Takeaways
  1. Read a file using memory mapping with mmap
  2. Compare performance of file I/O with memory mapping to vanilla file operations
  3. Time the two functions using timeit
  4. Use mmap to read data from a file and store it in a Python object
  5. Handle character encoding and byte boundaries when working with text data
💡 Memory mapping with mmap can provide significant performance improvements in code that requires a lot of file I/O, but requires careful handling of character encoding and byte boundaries when working with text data.

Related AI Lessons

Up next
Azure Security Priorities for 2026: Identity, Governance, AI Security & Zero Trust
Valto Microsoft Specialists
Watch →