How Google Photos scales to store 4 trillion photos and videos

Google Cloud Tech · Beginner ·📐 ML Fundamentals ·2y ago

Key Takeaways

Google Photos leverages Spanner for scalable data storage and machine learning-powered features, ensuring low latency and high reliability for billions of users globally. The service utilizes asynchronous processing, automatic data replication, and sharding to support real-time sharing and ML-based features.

Full Transcript

I love taking photos I have so many memories stored on my phone look at this photo I took last week with my dog it's my favorite photo I've ever taken and these from our walks I am so glad I have Google photos it makes it so easy to store organize and share my pictures I can access my photos from anywhere and then share them with anybody I choose across the world and it doesn't just do it for you and me it stores photos and memories for over 1 billion users at that scale it must be an engineering challenge and a Marvel to uncover and learn about are you excited to take a dive if it's me and how Google photos Works stick around because this is going to be fun [Music] thank you hi everyone I'm Priyanka vergarya and today I'm here with Tracy and Dave from our Google photos team who will help us take a closer look at how Google photos Works hi Tracy and Dave hey Priyanka hey Priyanka great to be here today so I'm Dave I'm a senior software engineering manager at Google and I lead the storage serving and ml infrastructure teams for Google photos and I'm Tracy a senior site reliability manager with Google photos and I've led the photos SRE team for over the past three years my team and Dave's team work very closely to ensure that photos is reliable available and efficient and that's no small task given that photos serves over a billion users as you said Priyanka and we continue to grow that's right Google photos is such a huge application and if we really I just want to dive deeper into how it works I'm so curious so if we track let's say life of a photo uh I took a picture on my phone what is actually happening behind the scenes can you walk us through that Dave we have a lot of cool stuff going on in our system so it all starts when a user captures a photo or video on their phone Google photos will begin to upload the photos and videos to our product if that user has the backup feature enabled backup is a really important feature it's very popular and it helps users automatically preserve their memories backup actually happens in two phases so first we upload all of the captured media bytes into our encrypted blob store and then we upload encrypt and store all of the associated metadata in spanner things like exif data file name and other product metadata at that point the user's photos and videos are safely stored in Google photos and we know that users trust us with their data so we take a lot of care a lot of time a lot of energy to make sure that the core values of privacy and security are kept in mind but honestly uploading the media is just step one from there a whole bunch of asynchronous processing kicks off much of which is tailored to a user's individual settings that processing drives many of Google photos most magical machine learning powered features like Memories search and a bunch of organizational features and users get to experience these features just very shortly after upload you just described it so well there because you upload the pictures and that upload part is sort of not even like I'm not even a part of it I just took a picture it automatically backs up it automatically goes there and then if I'm even not doing anything I just go back after a few days and I see some memories so it's just this that seamless processing of things that is happening behind the scenes as a user it is magical as an experience so thanks for walking us through that so Tracy how as we are as users are able to access these photos all the time really reliably across the globe what happens there hey Priyanka great question as you said that's the magical part of photos and photo actually relies on spanner to automatically replicate our data and ensure that the data is co-located with our Global user footprints the sharding by spanner also gives us low latency worldwide and makes it easy for us to support the ever increasing set of regulatory requirements concerning data residency however as the SRE in the room what's really interesting to us is that the system simultaneously has to be reliable and available for our user uploads what you experience but we also need to ensure that our ml-based features that Dave talked about also perform well and that ML and batch features and Computing can't impact our interactive users so Spanish sharding flexibility allows us both use cases to be satisfied in the same database we have read only and write shards to separate these use cases because we need to serve our active online users quickly because we know they expect their photos to be instantaneously displayed and shareable and we don't want those ML and batch features which are bringing a lot of the magic to photos to interrupt them we don't want to slow our users down an additional benefit that we have with these Banner shards I've talked about is we can perform slow rollouts which allows us to observe how changes perform incrementally and that's a huge win for our reliability of the Google photos product that's great because uh with spanner you're literally using just one database and that's just driving all of these experiences whether it's batch upload or it's uh it's these machine learning and really rich features that we enjoy um and experience now as we as we talk about machine learning I have to ask can we dive a little bit into how some of those ml features work especially at that scale of photos for I'll take an example here if I wanted to search my library for something like let's say I love my dog let's say I'm searching for all the pictures of my dog I'm wanting to create a custom album how does that work scale has become a bigger and bigger challenge as we've grown as a product we rely really heavily on asynchronous background processing to handle that scale and it actually makes up the majority of our workloads these days so after a photo gets uploaded to the system we cue that photo up for various forms of offline image processing and ml inference that help us label understand and organize that media within a user's Library the system doesn't just try to infer the semantic contents of a single photo but it actually tries to understand how that content fits into the broader context of a user's library and doing so involves many large and complicated queries within a user's Library all of which kind of happen concurrently with many other reason rights running at the same time and once the system's inferred labels from that photo the labels actually get stored in spanner indexed by our Downstream indexing system they get combined with outputs from our clustering algorithm and form a basis on which users can search their library and experience it in different ways this reminds me of an example where I um so at my wedding I had some extensions on so I had long hair and um and and I normally I never had that before so um it got like so when I when I uploaded those pictures got uploaded there's like there was this confirmation right is this still you and then it's like taking four or five of these examples and like trying to evaluate if it's still me um which was amazing like I would like I love those features where I don't have to tell you that it's me you're like I already identifying it's me and then doing it and it's all that machine learning goodness that's happening so with that uh photos search really is an amazing feature but also sounds like a lot of a lot went into building it right yeah to your point the system is constantly learning and improving and we really want to deliver the best possible user experience that we can we push ourselves constantly on it but to tell you the truth we're so lucky to have a large and really talented team working on Google photos and we're also supported by a great research team and a whole bunch of great infrastructure teams it really is one big team effort to make all of this happen and now I keep going back to that life of a photo because I really grasp concepts with some examples so here's another scenario I love the pictures from my recent trip to Costa Rica and I really want to share them let's say with my mom who is across the world in India how does sharing like that in real world happens how does that work yeah so we're a close tie sharing product and we really want to make sure that no matter how far your friends and family are you're able to share your memories with them and you know even if it is a global user base users still have high standards they want seamless real-time sharing of their content unfortunately spanner has Global indices and other functionality which have helped us deliver on that user experience and if you think about it this this really is conceptually very simple you have users that are floating their photo bytes are going into blob storage the metadata is going into spanner and there are lots of microservices all that are receiving and adding information to this Central huge scale database spinner but the big deal here is the choice of tools right so in this case spanner so can we go into how spanner supports photos as at such an incredible scale simple is definitely good but the truth is that we actually run hundreds of binaries and microservices under the hood and all of these Services have different access patterns ranging from small lookups to large data scans as well as a variety of latency and reliability requirements but despite all that spanner's been able to meet every requirement we've had completely off the shelf it's been super flexible really easy to use and I think it's really impressive given our massive scale so one of the key things as a user of Google photos that I care about is the security of my photos and videos who can access them and when and how so how do you handle the security aspect in Google photos user trust is a key part of our story spanner is super reliable so we're confident that user data will always be available when a user wants it or needs it and it's also super secure data is encrypted at rest and it's tightly Access Control to protect against Bad actors or other unauthorized access these are very fundamental security measures that we have in place for every feature and as we build new features that security and privacy is kept in mind from the very start which has helped us iterate quickly without compromising the bar that is great to hear and as a user when I have these pressure memories in my photos I want to be able to trust that I'll be able to access them reliably wherever and whenever I want so um can you tell a little bit about how spanner assists with not just security but also that trust powered I think there's two key things I think we have a simpler more maintainable architecture and I think it's helped us move more quickly and a lot of that is because spanner is so scalable and easy to use we have been able to sustain that single database and it's yielded that simpler development experience and allowed us to lean into our micro system architecture all of us in turn has helped us manage our technical debt and support a fast tight release cycle not to mention spanners SQL features have helped our developers write highly optimized queries and also more easily debug the features and services that they're building yeah Dave I was just remembering how there was a lot of doubt in a lot of people's mind that we could actually run a database at this scale a single database at this scale and it's we've actually been very successful at it but I would say equally as important for velocity on the SRE side the fact that spanner is reliable allows us to run a single database and the features within spanner we've significantly reduced our toil as an SRE team we save a lot of time and energy on tactical placements location distribution sharding scaling redundancy and the backup management for replicas for example all we need to do is input the specs to spanner and spanner handles the change in the replication that we would that we desire in addition the self-healing nature of spanners such as automated index verifications automatic sharding draining guaranteed data consistency saves us a lot of manual work anything else that you would like to highlight that we haven't covered yet look the truth is the photos have seen amazing success running on spanner and I think the numbers speak for themselves our product is massive over a billion users over a trillion images over 4 trillion images all stored securely and privately and spanners played a huge role in all of this it serves millions of queries across dozens of geographical zones in order to support us and believe it or not we've actually experienced that 10x growth since starting out and we're confident that spanner is going to be able to support another 10-fold increase and continue to help us develop to deliver amazing incredible user experiences it's really amazing what it's helped us to do and I can't wait to see what helps us do in the future quite frankly I think it's going to really help us bring Google photos to the next billion users that is amazing thank you so much for sharing your insights with me walking me through how the whole thing works what happens with the photo and what is happening behind the scenes empowering that amazing experience that billions of us are using today um thank you Dave and Tracy hey Priyanka thanks for having us we love to talk about photos all day long as you can tell yeah this was a ton of fun thanks so much Priyanka well I really learned a lot about the lifo photo and how it goes through the Google photos infrastructure the big lesson as a cloud architect here is the architecture is usually pretty simple it's the choice of the right tools that lead to success in Meeting those goals in this case spanner is the key ingredient in scaling the photos platform thank you

Original Description

How Google Photos scaled rapidly on Spanner → https://goo.gle/44HbYDL You can download, view, or edit your photos on Google Photos from anywhere in the world, anytime. How does Google offer such a reliable service for billions of users globally? In this Architecting with Google Cloud episode, Developer Relations Engineer Priyanka Vergadia interviews colleagues Tracy Ferell, SRE Lead, and Dave Perra, Sr Software Engineering Manager, from the Google Photos team to learn how their team keeps Google Photos reliable and billions of users happy. Chapters: 0:00 - Intro 0:52 - Meet the Google Photos team 1:59 - The life of your photo on Google Photos 3:35 - How is Google Photos always available? 5:31 - How does machine learning work with Google Photos? 8:02 - How does Google Photos allow instant sharing globally? 9:43 - How does Google Photos secure and protect photos? 12:15 - Unprecedented growth 13:01 - Wrap up Architecting with Google Cloud playlist → https://goo.gle/ArchitectingWithGoogleCloud Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech #ArchitectingWithGoogleCloud
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Google Cloud Tech · Google Cloud Tech · 31 of 60

1 I’m going for it #GoogleCloudCertified
I’m going for it #GoogleCloudCertified
Google Cloud Tech
2 I had to get #GoogleCloudCertified
I had to get #GoogleCloudCertified
Google Cloud Tech
3 Be better overall at what you do #GoogleCloudCertified
Be better overall at what you do #GoogleCloudCertified
Google Cloud Tech
4 Cloud Monitoring on our radar #Analysis #Uptime
Cloud Monitoring on our radar #Analysis #Uptime
Google Cloud Tech
5 Introduction to Generative AI Studio
Introduction to Generative AI Studio
Google Cloud Tech
6 How to use Github Actions with Google's Workload Identity Federation
How to use Github Actions with Google's Workload Identity Federation
Google Cloud Tech
7 Introduction to Responsible AI
Introduction to Responsible AI
Google Cloud Tech
8 Networking updates and CDMC-certified architecture
Networking updates and CDMC-certified architecture
Google Cloud Tech
9 Create and use a Cloud Storage bucket
Create and use a Cloud Storage bucket
Google Cloud Tech
10 How to digitize text from documents
How to digitize text from documents
Google Cloud Tech
11 Faster analytical queries with AlloyDB
Faster analytical queries with AlloyDB
Google Cloud Tech
12 Next ‘23 sessions and FaaS Wave
Next ‘23 sessions and FaaS Wave
Google Cloud Tech
13 Introduction to Assured Open Source Software
Introduction to Assured Open Source Software
Google Cloud Tech
14 BigQuery Cost Optimization: Storage
BigQuery Cost Optimization: Storage
Google Cloud Tech
15 BigQuery Cost Optimization: Compute
BigQuery Cost Optimization: Compute
Google Cloud Tech
16 BigQuery Cost Optimization: Select Queries
BigQuery Cost Optimization: Select Queries
Google Cloud Tech
17 Remote Field Equipment Management with Manufacturing Data Engine
Remote Field Equipment Management with Manufacturing Data Engine
Google Cloud Tech
18 Supercharging your applications with Cloud SQL Enterprise Plus
Supercharging your applications with Cloud SQL Enterprise Plus
Google Cloud Tech
19 Vector Support on our radar #GenAI
Vector Support on our radar #GenAI
Google Cloud Tech
20 Architecting a blockchain startup with Google Cloud
Architecting a blockchain startup with Google Cloud
Google Cloud Tech
21 Kubernetes and multitasking updates!
Kubernetes and multitasking updates!
Google Cloud Tech
22 GKE: Using Kubernetes Events
GKE: Using Kubernetes Events
Google Cloud Tech
23 How to configure firewall rules for Cloud Composer
How to configure firewall rules for Cloud Composer
Google Cloud Tech
24 Vertex AI Embeddings API + Matching Engine: Grounding LLMs made easy
Vertex AI Embeddings API + Matching Engine: Grounding LLMs made easy
Google Cloud Tech
25 Geospatial analytics on our radar #EarthEngine #BigQuery
Geospatial analytics on our radar #EarthEngine #BigQuery
Google Cloud Tech
26 Ensuring requests are set in Kubernetes
Ensuring requests are set in Kubernetes
Google Cloud Tech
27 Cloud Next 2023, Google research program, and more!
Cloud Next 2023, Google research program, and more!
Google Cloud Tech
28 How to migrate projects between organizations with Resource Manager
How to migrate projects between organizations with Resource Manager
Google Cloud Tech
29 How to run #MySQL in Google Cloud
How to run #MySQL in Google Cloud
Google Cloud Tech
30 #GenerativeAI for enterprises and #Next2023
#GenerativeAI for enterprises and #Next2023
Google Cloud Tech
How Google Photos scales to store 4 trillion photos and videos
How Google Photos scales to store 4 trillion photos and videos
Google Cloud Tech
32 Google Cross-Cloud Interconnect (Demo 2)
Google Cross-Cloud Interconnect (Demo 2)
Google Cloud Tech
33 GKE Cost Optimization Golden Signals: Introduction
GKE Cost Optimization Golden Signals: Introduction
Google Cloud Tech
34 GKE Cost Optimization Golden Signals: Workload Rightsizing
GKE Cost Optimization Golden Signals: Workload Rightsizing
Google Cloud Tech
35 GKE Load Balancing: Overview
GKE Load Balancing: Overview
Google Cloud Tech
36 GKE Load Balancing: Best Practices
GKE Load Balancing: Best Practices
Google Cloud Tech
37 Disaster Recovery in GKE
Disaster Recovery in GKE
Google Cloud Tech
38 How to configure IP masquerade agent in GKE Standard clusters
How to configure IP masquerade agent in GKE Standard clusters
Google Cloud Tech
39 Enable and use GKE Control plane logs
Enable and use GKE Control plane logs
Google Cloud Tech
40 Compliance in Australia with Assured Workloads
Compliance in Australia with Assured Workloads
Google Cloud Tech
41 Creating budgets and budget alerts in Google Cloud #FinOps
Creating budgets and budget alerts in Google Cloud #FinOps
Google Cloud Tech
42 Cloud SQL Enterprise Plus on our radar #mySQL
Cloud SQL Enterprise Plus on our radar #mySQL
Google Cloud Tech
43 What's Next for Google Cloud?
What's Next for Google Cloud?
Google Cloud Tech
44 How Loveholidays scaled with Contact Center AI
How Loveholidays scaled with Contact Center AI
Google Cloud Tech
45 What is fleet team management in GKE?
What is fleet team management in GKE?
Google Cloud Tech
46 Troubleshoot VPC Network Peering
Troubleshoot VPC Network Peering
Google Cloud Tech
47 Introduction to DocAI and Contact Center AI
Introduction to DocAI and Contact Center AI
Google Cloud Tech
48 Cloud Run Direct VPC egress explained
Cloud Run Direct VPC egress explained
Google Cloud Tech
49 Database deployment options in GKE
Database deployment options in GKE
Google Cloud Tech
50 Analyze cloud billing data with #BigQuery
Analyze cloud billing data with #BigQuery
Google Cloud Tech
51 Tips to becoming a world-class Prompt Engineer
Tips to becoming a world-class Prompt Engineer
Google Cloud Tech
52 Serverless is simple. Do I need CI/CD?
Serverless is simple. Do I need CI/CD?
Google Cloud Tech
53 Accelerating model deployment with MLOps
Accelerating model deployment with MLOps
Google Cloud Tech
54 How Hawaii's Department of Human Services scaled with CCAI
How Hawaii's Department of Human Services scaled with CCAI
Google Cloud Tech
55 Pricing API on our #Radar
Pricing API on our #Radar
Google Cloud Tech
56 How Recommendations AI for Media can boost customer retention
How Recommendations AI for Media can boost customer retention
Google Cloud Tech
57 Troubleshooting: Node Not Ready Status
Troubleshooting: Node Not Ready Status
Google Cloud Tech
58 One weekend until Cloud Next 2023!
One weekend until Cloud Next 2023!
Google Cloud Tech
59 #GoogleCloudNext starts tomorrow!
#GoogleCloudNext starts tomorrow!
Google Cloud Tech
60 #GoogleCloudNext will be demand!
#GoogleCloudNext will be demand!
Google Cloud Tech

Google Photos' architecture relies on Spanner for scalable data storage and machine learning-powered features, ensuring low latency and high reliability for billions of users globally. This video teaches how to design and implement such a system, highlighting the importance of asynchronous processing, automatic data replication, and sharding. By watching this video, viewers can learn how to build scalable ML pipelines and design low-latency data storage systems.

Key Takeaways
  1. Design a scalable data storage system using Spanner
  2. Implement asynchronous processing for ML-based features
  3. Configure automatic data replication and sharding for low latency
  4. Develop a system for real-time sharing and collaboration
  5. Optimize database queries using SQL features
  6. Monitor and debug system performance using Spanner's self-healing nature
💡 Spanner's scalability, ease of use, and self-healing nature make it an ideal choice for large-scale database management and real-time sharing applications.

Related AI Lessons

Chapters (9)

Intro
0:52 Meet the Google Photos team
1:59 The life of your photo on Google Photos
3:35 How is Google Photos always available?
5:31 How does machine learning work with Google Photos?
8:02 How does Google Photos allow instant sharing globally?
9:43 How does Google Photos secure and protect photos?
12:15 Unprecedented growth
13:01 Wrap up
Up next
Learn Deep Learning by Hand (Beginner's Guide - Part 1)
Thu Vu
Watch →