Loop Tiling - HTTP 203
Skills:
Systems Design Basics70%
Key Takeaways
Optimizes image rotation code using Loop Tiling and WebAssembly
Full Transcript
but stick three I don't know if you've noticed let me build a thing are we gonna talk about Scrooge again a little bit okay another aspect of Skoosh all right so that's kind of interesting okay so this might be a long one so bear with me I'm gonna start at where we started and then we kind of fell down into this rabbit hole and I want the audience to fall into the rabbit hole with us with it yes and I'm really looking forward to this one because sometimes when we do these yeah one of us is maybe slightly pretending to know less about the subject than we do whereas in this one there's a lot that I really don't understand and I'm really worried and might not actually be able to explain everything as much as you would like me to okay well let's see where we end up yes I'll let you know honestly let's start with what our images on the web if we manipulate them with JavaScript so let's talk about image data which is a data structure that we use in squash so once you get an image in we decode it and we turn it into an image data object which is a data structure that exists on a platform basically has three properties with height and data yep data there is a un8 clamp array and in there you have just four bytes for each pixel yes and they it's the first row and the second row and so on so each one is like red green blue alpha exactly so you see here's like it's a red pixel than a green pixel and a blue pixel and white pixel and because the image is two by two that is what the image would look like right nice so it's only just a series of numbers with no concept of rows or columns but because of that information we can rearrange mean to interpret them as a proper two dimensional image ring that's kind of how it works all right so now in Scrooged we had the goal to rotate an image by 90 degrees sounds like a simple thing probably only take 10 minutes I mean you wrote it and the first version right and so let's talk about how you wrote it you rotate an image by 90 degrees gets an input image which is this image data object yes it is and and what we do figure out by ng growth what is the new width in a new height which is pretty much just you know height and width swapped your doing fancy Surma code already a little bit because otherwise it wouldn't fit so I'm compressing things down right okay so here you're essentially assigning the height to the width and the whip to the high because it's 90 degree right okay I'm favoring and I'm training a new output image width has this new width and the new height yes so now the goal is to go through pixels and put them in the right spot in the Alpha so you get relief for loop over all the pixels in the input image and we figure out where they would have to land in the output image so basically the new x-coordinate is that kind of form within the y-coordinates that and then we figure out which input pixel it is which out to people and just copy it over more fancy so my code here wouldn't get through review I know okay and then because we have four bytes per pixel we just loop over four times and just do everything right like we copy the R value the G value the B value and the a value yeah enough enough we go and this works and this was actually decently fast we shipped it this way we should say the reason we did this rather than canvas is because we wanted to run it in a worker that's an entire difference oh but yes we did a lot of tests with what seemed more fancy fashion technology didn't seem to work so we ended up writing our own piece of JavaScript just for this problem yes because off-screen canvas only in a couple of browsers whereas this is just basic JavaScript JavaScript so that works everywhere anything on a worker because it's an image of it so this we should this this worked yeah yes and then I looked at at some point and was like there's actually um kind of not too obvious optimization that you missed and so I basically I bet you added a little patch and this all states the same same as before but now I'm creating a you 32 array yes but this is so basically we have the same underlying chunk of memory but instead of seeing it as a series of bytes we see it as a series of 32 bits numbers right pixel consists of a 32-bit number right for RGB and a and so this way we can simplify or actually remove the inner loop so yeah it's just bit that was here that that you know doing something four times every time we're now just doing it one copy operation which actually maps to a machine instruction or sometimes a v8 will be like super smart and go like well fast so they sort actually know quite a bit faster yes so cool and then we ship this still fine and then it turns out that for some reason in one browser this was super slow right and we've been advised by legal by our legal department to not name the browser apparently it's a chrome policy not it I've never heard that before but no we're not allowed to talk about other browsers so we can't mention which browser it is one that didn't run no machines it didn't run on your machine did it you know you know that's here on this different browser okay either way like most brothers were fine good enough at least and then for some reason this one brother just ended up being extremely so like unreasonably slow even so we must have hit some weird corner case yes yes brother isn't slow usually it's a very good browser yes and and different JavaScript engines optimized with different things so the fact that one browser was slower here isn't saying that that browser is terrible is just saying like v8 is very good with this kind of tight loop code yeah other engines have optimized for like more dom bindings exact so it wasn't that surprising that one browser was completely different in terms of plans with this piece of code so we thought no what do we do maybe we fry more web assembly at the problem right shall we looked into that and the first problem we had that when you're bright represent Lee and you loaded it turns into a module that has functions the functions that you wrote in whatever language you were using yes right it's this is different to an echo script module it's it wasn't modules a different thing in a different thing and these functions can only take in and return numbers so it's no easy way straight up to like pass in an image so what do you do right so what what we ended up doing I'm going to reuse the video I made for my article brilliant basically the Java so we're going to load the image put it into the web assembly memory and then we're going to use web assembly to just do the reordering within that web assembling memory buffer and use JavaScript to read it back afterwards right so that means the represent me really is completely isolated from all of the outer world really so this it just has its chunk of memory to work on we'll read in the image do the reordering that we've shown before and then Java swoop comes back takes over and reads back the resulting image so these the javascript and web assembly the thing they share is memory like they carry much but is it twelve assembly it's its assembly dog memory is webassembly specific memory but it is also exposed as an array buffer that we can use either in Java in 32 array or whatever we need in that very instance right so the memory we need for web assembly is essentially double the size of the image because it's going to yeah you can start with the main image in memory and then the next bit and then okay yeah so how do we create a zone we've done it before with M script in C but yes also rust but we actually found a very interesting project we stumbled over called assembly script yes which is a they call themselves a type script to webassembly compiler mm-hmm which is true but might be a little bit misleading because it is not you can't just take any type script and compile it to Epis Emily it is using the typescript syntax and the typescript standard library things but with their own version of certain library that is specifically tailored to web assembly so what you can see here is the signature way now we have types as you know form web form typescript but there's the I 32 type which the type web assembly has but JavaScript doesn't and that's 32-bit integer right the yes the signs 32-bit interests signed for it's also the okay you 32 which is the unsigned while we're using signed for reasons for reasons okay let's gloss over it yeah this is good cuz I I can recognize this it looks a lot like JavaScript and so a lot like well the rest except for two lines so this looks the same you know which Hyden with yep now this is a bit interesting because we have this chunk of memory mm-hm we can't even know where our input image starts and where our output image starts right that's what these two variables are so our input image starts at zero at address 0 in this memories are always those index 0 you can say and the output image is right after the input image ends and the input image consists of width times height times 4 5 4 bits per pixel by its lights perfect so thank you see the thing about this and I'm sorry to interrupt the flow say I should say that I came to the web as a CSS person CSS front-end and I learned JavaScript whereas you came to the web from being a programmer well and then you went to the desk I was I I did embedded systems like I was literally writing kernel code and like low-level memory management and I had no idea about CSS and how to do UI in anything like that right it's just two completely different angles by would slightly if anyone is watching this thinking what is going on yeah this is I am feeling exactly the same system don't worry too much about it right come on but for now these are basically indices in the array where does the image start where at the output image start and then this looks familiar looping over all the pixels yeah I'm freaking out where the new coordinates are we did all this before yeah and now there's these to assembly script specific functions the first one is load which allows me to load a u-32 from the memory at a given address right and so in this case what I'm doing is I'm using the input image space that's where the image starts plus the pixel I want to read so this is very similar to what we were doing before with the U in 32 yeah ray but it's word but with this I seem to get it straight from memory Ravi yeah because it's a webassembly memory and that's like right listen it's something that you get handed FS as a reference it's just there as a global almost like but it's the same thing we're passing the same industry in X to it exactly okay so we're loading our pixel and then all we have to do is like write it back to the output image and it's the same thing we're storing the value V which we just read mm-hm back as I usually to into the output image base okay and now we have written assembly script and and this converts to web assembly and what I what really struck me with this is that if I wanted to write web assembly this is the tool I would use yeah because this looks very early it's very much right because yeah you I think you've learned a bit of C because of Skoosh yes but that's pretty much it as far as I know you have not written rust I think you can write it I know PHP PHP to webassembly compiler but I would love it as the first first language I learned so we have this function now and now we want to compile to web assembly and luckily assembly makes it very easy so we just install the assembly script package and we have an ASC command which we give our typescript filter and give us back webassembly file with no additional glue or JavaScript which I think it's quite interesting because most other implementations for web assembly give you blue code which is huge javascript file device really yeah difficult to deal with and work with but this is just yeah just a wasn't right so we did this we got a road it wasn't foul and now the interesting bit might be how to load it because usually blue code loads it for you but now you don't have glue code how does this work it's actually not that difficult what you do is you take the instantiate streaming function from the web assembly object right and put your action there because the web assembly compiler at least the non optimizing one can compile while the weather is still downloading so this essentially screaming takes a promise a promise or response or an array buffer or that's a weird API and it's like why would why does it take a promise it's us they want to make this simple if you don't have to await the fetch right okay don't I don't agree with it but that's fine you should I put it away in there either way and it starts compiling wallet so darlings so it's not like download then compile it's actually almost in parallel which you know for webassembly most we can be quite big you know like I think the unit the Unreal Engine one is like 40 megabytes that will make quite a difference yes absolutely not so much here so no absolutely not so yeah there was models by the way it's like 500 bytes or something so it's really small it's all it's smaller than the compressed and gzip JavaScript code that we had nice actually cool and now we get an instance back from this one and on that instance we can have exports and then exports is all the functions but also the memory that we are going to work on right so we can grow our memory because we don't know what size it has but we have to go to the size that it fits our images two times right which would have to do calculus kip this year but so that would just be that this the size of the invalid array they are times to yeah okay okay and then I will somehow load this image into the buffer which is really just memory has about buffer property which is a normal array buffer so we can use all the methods you know to put data in there right there's to this in yeah and then you're a Korell hit 90 and with the image back and you're done also exports has a goal of the method so this is the method this is the magic where you call into webassembly and you can all see it's synchronous so webassembly is something that will actually take the control away from javascript and do its thing and then return the control back to javascript it's just like an actual function okay okay I think it's super nice and so this was fast we were super happy about this yes this this was much faster than well it wasn't faster in Chrome in the sense that like it didn't outperform jasmine was asked fast or almost asked fast but it was consistently fast across all browsers yes it was it taken the browser it doesn't run on Mac from seven seconds down to Lake 500 or something 500 milliseconds something was very very accessible yeah it was really nice to see that similar value across all browsers so we were super happy about so we you know we open the PRN or Scrooged and you reviewed it and we wrote an article and then uhm hacker news happens how can use happened and that's something I would never say because usually the comments on our articles are quite annoying I can use has a can sometimes be quite pedantic I forget but in this instance there was some pedantry but the pedantry was really interesting it was really hard thing some fascinating results and I was both just it's a lot of it I didn't understand yeah and I hope you're gonna explain it to me yet so someone said why aren't they using tiling tiling would make this so much faster but they quickly tried yeah I totally did this in like 20 milliseconds I was like what yeah so they were taking it from what what was it sort of friendly 400 millisecond dip down to what was it I think forty forty which is it's approve and and that's even faster than we were seeing from a canvas the element like it's yeah so and I had like obviously sit down and actually understand what's happening so let's talk about what tiling actually is yes please do because I have no idea so I'm gonna explain tiling but it was also another suggestion for performance of to manage I'm gonna talk about both of these other one first because to get it out the way basically some people were saying oh if you look at this Y times width it's completely independent of the inner loop so if I move it out between the outer and the inner loop I would make it faster because I calculation can happen only once / outer loop it doesn't need to have them every time in the inner loop yes and I thought this was going to be the kind of thing that the optimizer thingy did I would take care of for me end of this so this is exactly kind of advice where you don't have to worry about these kind of things or like moving constants out of a loop is something that not only most compilers can do it's like the assignment your compiler could do this or the RUS compiler but even the v8 compiler from that go from the JavaScript to machine code or from web assembly byte code to machine code will do this mm-hmm so this is an operation we don't have to do and where we can say you know let's keep it readable and obvious and don't introduce another variable where people that read the code would have to have even more state in their head to understand what's going on yes okay but the other thing is tiling and tiling is something that I hadn't heard of I actually had heard of it but I also wasn't the impression that the compilers will do it for us and in this case it is not what is it sort of tiling so this is an image or actually the album cover for podcast I don't know did you know that we do a podcast a podcast as well we should link to it in the description Jake yes we should so we have been reading this image so far like this we should going row by row yeah and just you know what is this pixel pairs abalone okay copy and look at the next pixel in the same room that's kind of what we did and we thought fine tiling is a different approach where you tile the image into tiles that's good yet those are tiles and then do whatever you're trying to do within a tile first so instead of going row by row your school tile by tile and within that tile you go row by row this this is legitimately it's the same thing this is legitimately a different way of doing the same thing I know now the interesting that this turned out to be so much faster yeah an extent at the time I still don't understand yet so let's implement this real quick which because it's not actually can I just say one of our previous recent episodes we talked about the dangers of over optimization yeah and when what why are we doing this because it ends up being so much faster okay okay we actually met with this optimization we end up going well below 100 milliseconds which within the rail guidelines makes it feel like an instantaneous response to the button it is an optimization before that we were at like 300 to 500 sign but you know if we can go under 100 we should go under 100 especially for bigger images okay so basically I just do an additional two outer loops but usually sounds wrong but in this case is very very right where we iterate over all the tiles that we have and then in there we basically have the same old loop where we loop over each individual tile I'm starting to hyperventilate wait why okay I so this is tiling implement it so I get it and let me talk about why this might thinks make things faster okay that is the bit I don't understand so originally tiling whenever you when I google tiling and research it it was mostly the use case for matrix multiplication which is a different use case because input values I used multiple times so right if you if you multiply two matrices you have to read the cell at one-one multiple times for I think for each column that you're cutting the output matrix okay so it makes sense that if you do tiling you have a better chance of having that value still in the cache we're talking now processor level 1 cache by the way so it's a hang on like okay we will need to explain what that is at some point as well but late my feeling is by reading memory sequentially you're more likely to hit caches because you're talking you're dealing with a little bit of memory that was very close to the last yeah so if I have these two really big matrices and I go through the first row of the input matrix by the time I come I end up at the ends the values from the start might have been kicked out the cash because level one cache and the processor is really small we're talking like 200 kilobytes of cache maybe all right so the the processor has like an l1 cache you know it's like super fast so that there's these set of caches that gets bigger and slower yeah until you get to memorize your memory is actually really slow memory is really just in relative terms yeah okay and so what the tiling does is by shore meaning the amount of time you spend going away from your initial value you have better chance of having the initial value still well for matrix multiplication so with this one yeah this is a hyung that sense why this would make it because ER because the the second row is a massive jump from the first the rotation we read every value once and we write it once so why would caching make things better that's that is roughly the question I have in my head so there's two theories and I don't know one of which one of them is actually true are you telling me you don't even know well I even talked to Benedict our vADM engineer and I have two theories but it's really hard to test ok so ok one version is that lots of processing artists are really smart at predicting what memory you are going to grab next so by basically seeing the child it can make better predictions what Oh sells to grab already put into the cache for you even though you have an exit that code yet and the other thing is that because the cache is so small that there's a certain pattern which cell can be cached in which cache cell so this gets a little bit confusing but basically if you think about it you have if you have like three cache cells just three individual cells what is what what can go in a cell like one value one value okay okay um you say okay so memory address 0 can only go in cache cell zero and iris 1 can go and catch someone cash two can go in cache cell two memory address three can only go in ziering and you wrap around right so you assign those yep and then again by keeping it smaller you have a better chance of not overwriting the old veldt if you have put in into your level 1 cache so basically all this is about is by making things making your access memory smaller so that you don't evict the cash from the things that you what so this is basically it's working because our inner loops are smaller yeah it makes better it so it makes the processor make better predictions and also make those not evict the cache because the area we work on is smaller so then just the tile size yeah what's the top slice so that's what I thought right okay so I did some benchmarks on a MacBook on an iMac and a pixel three because bigger the machine or the bigger the processor the bigger the level come one cache usually is right so the iMac that I have is like an 18 core massive structure everything it has massive l1 while the pixel 3 obviously has a very very tiny level 1 cache all this code is single call anyway right yeah so yeah so basically at 0 is the relative time it took for no tiling so that's that's the original that's the baseline time wasn't yes so what you can see here is how the time shift we relatively tow that base time depending on what the tile size is dangerous so if I have a tile size of 2 a 2 by 2 pixel grids it makes the code slower which is not very surprising because you have so much more looting going on and more jumps okay it gets faster really really quickly at some points over here you kinda hit level 1 cache boundaries for it then gets slow again right I see ok to be honest there's one weird thing where the pixel 3 is slow even with a massive grid which I'm not quite sure why that is I think actually fast even yeah it's just I I expect the pieces like go up somewhere around here you'd assume the level 1 cache is less than Mac probably is and that's probably other affected I don't quite understand ok but what I thought architecture is well in that processor but it seems to be a sweet spot between like 16 and I don't know 64 depending on what you want I'm actually I think 16 looks really promising in this graph which means you have like it 256 pixel grit that you work with I thought I was gonna come into this episode and I was going to go away understanding why the tiling works no it just does it's it's it's as much I spent the last week on this right right you've been kind of sitting across the moon hearing me talking to people and trying to figure this out this is as close I've gotten to understand it and that there is this interaction between the process are predicting what values were in the cache and then not forcing the positive evict that cache because you read too far ahead but this is a massive case for tools not rules right like don't don't go away and rewrite all your codes with tiling your title no you guys is something you would have to very carefully profile on this wide range of machines yeah from processor architectures to see is this actually when we cross I find anything because we started at let's rotate an image a very high level use case and we fell down and ended up with like let's talk about processor architecture and level 1 caches yes so thanks to hacker news I guess for ruining my week but it's been actually been very educational even though I still don't fully understand it but I feel like hey man that yeah I feel like my understanding of lower level stuff is like I said there's that confusion element but I feel like I've got an appreciation for like the smarts right that yeah I go into that this it's it's incredible so let's let's take a breather I knew you'll see our Rikuo audience next time but is it this is going in description yeah it's coming right yes oh that's what one thing out there let's write this down um how do I fix this oh the Edit something oh okay oh yeah
Original Description
Jake and Surma talk about how they optimized the image rotation code in their app Squoosh, how Hacker News taught them about Loop Tiling.
HTTP 203 Podcast: https://developers.google.com/web/shows/http203/podcast/
Surma’s article: https://developers.google.com/web/updates/2019/02/hotpath-with-wasm
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Chrome for Developers · Chrome for Developers · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Polymer Performance Patterns (The Polymer Summit 2015)
Chrome for Developers
Polymer Power Tools (The Polymer Summit 2015)
Chrome for Developers
Chrome Dev Summit 2014 – Chrome Case Studies
Chrome for Developers
Web Directions Code 2015 round up
Chrome for Developers
Maintainable Code - HTTP203
Chrome for Developers
iron-ajax… wat?! -- Polycasts #26
Chrome for Developers
The Guardian - Supercharged
Chrome for Developers
ES2015 (next version of JavaScript), Totally Tooling Tips (S2 Ep1)
Chrome for Developers
#AskPolymer: Rob answers all the questions ever -- Polycasts #27
Chrome for Developers
The Future of JavaScript - HTTP203
Chrome for Developers
Data Binding 101 -- Polycasts #28
Chrome for Developers
The Guardian part 2 - Supercharged
Chrome for Developers
The Future of Web Audio: with Chris Wilson and Chris Lowis
Chrome for Developers
Chrome 46: New motion-path animations, client hints and service worker improvements
Chrome for Developers
Sublime Snippets, Totally Tooling Tips (S2 Ep2)
Chrome for Developers
#AskPolymer: How do you make the show? -- Polycasts #29
Chrome for Developers
Critical Path CSS, Totally Tooling Tips (S2 Mini Tip #1)
Chrome for Developers
Binding to Objects -- Polycasts #30
Chrome for Developers
Player FM - Supercharged
Chrome for Developers
Where’s the Designer? #AskPolymer -- Polycasts #31
Chrome for Developers
Jake Beats Wikipedia - HTTP203
Chrome for Developers
Supercharged Observers! -- Polycasts #32
Chrome for Developers
Jai's Web blog - Supercharged
Chrome for Developers
Windows Command-line Tooling, Totally Tooling Tips (S2, Ep4)
Chrome for Developers
What about internationalization? #AskPolymer -- Polycasts #33
Chrome for Developers
Developing for Billions (Chrome Dev Summit 2015)
Chrome for Developers
Google+ Performance Improvement Comparison
Chrome for Developers
Deploying HTTPS: The Green Lock and Beyond (Chrome Dev Summit 2015)
Chrome for Developers
Progressive Web Apps (Chrome Dev Summit 2015)
Chrome for Developers
Instant Loading with Service Workers (Chrome Dev Summit 2015)
Chrome for Developers
Increase Engagement with Web Push Notifications (Chrome Dev Summit 2015)
Chrome for Developers
Engaging with the Real World: Web Bluetooth and Physical Web (Chrome Dev Summit 2015)
Chrome for Developers
Asking for Permission: respectful, opinionated UI (Chrome Dev Summit 2015)
Chrome for Developers
Polymer - State of the Union (Chrome Dev Summit 2015)
Chrome for Developers
Building Progressive Web Apps with Polymer (Chrome Dev Summit 2015)
Chrome for Developers
Introduction to RAIL (Chrome Dev Summit 2015)
Chrome for Developers
DevTools in 2015: Authoring to the max (Chrome Dev Summit 2015)
Chrome for Developers
RAIL in the real world (Chrome Dev Summit 2015)
Chrome for Developers
#ChromeDevSummit talks are up - W00T! -- Polycast #34
Chrome for Developers
V8 Performance from the Driver's Seat (Chrome Dev Summit 2015)
Chrome for Developers
Quantify and improve real-world RAIL (Chrome Dev Summit 2015)
Chrome for Developers
Owning your performance: RAIL (Chrome Dev Summit 2015)
Chrome for Developers
HTTP/2 101 (Chrome Dev Summit 2015)
Chrome for Developers
Leadership Panel (Chrome Dev Summit 2015)
Chrome for Developers
Build Processes, Totally Tooling Tips (S2, Ep 5)
Chrome for Developers
Accessibility (Chrome Dev Summit 2015)
Chrome for Developers
Binding to Arrays -- Polycasts #35
Chrome for Developers
HTTP2 - HTTP203
Chrome for Developers
Chrome 47: Splash Screens, requestIdleCallback and better desktop notifications (New in Chrome)
Chrome for Developers
Call For Submissions - Supercharged
Chrome for Developers
Cross Device Testing, Totally Tooling Tips (S2 Ep6)
Chrome for Developers
Testing AJAX with Web Component Tester -- Polycasts #37
Chrome for Developers
Slack: Extended Xmas Special - Supercharged
Chrome for Developers
Browser testing with Travis & Sauce Labs -- Polycasts #38
Chrome for Developers
Optimize for production with Vulcanize -- Polycasts #39
Chrome for Developers
Highlights from Chrome Dev Summit 2015
Chrome for Developers
Chrome 48: Custom buttons in notifications, DevTools Security panel, and Presentation mode
Chrome for Developers
Crisper: Protecting your Polymer app with CSP -- Polycasts #40
Chrome for Developers
How do I use Sass with Polymer? #AskPolymer -- Polycasts #41
Chrome for Developers
Colors – DevTools Tonight #0 (Pilot)
Chrome for Developers
More on: Systems Design Basics
View skill →Related Reads
📰
📰
📰
📰
The Digital Skills That Will Actually Pay in 2026 (And the Ones Quietly Dying)
Medium · AI
AI And The Rise Of The Bit Economy: A Structural Shift
Forbes Innovation
2026 Is the Year Everyone Is Redesigning Themselves. Are You?
Medium · AI
EU tech chief and Tim Cook hold ‘constructive’ talks as Siri AI stays blocked in Europe
The Next Web AI
🎓
Tutor Explanation
DeepCamp AI