Privacy Preserving AI - Andrew Trask, OpenMined
Key Takeaways
This video covers the basics of secure and private AI techniques, including federated learning and secure multi-party computation, using tools like PySyft and PyTorch.
Full Transcript
hello everyone my name is Andrew and today we're going to be talking about privacy preserving AI specifically we're going to be asking the question is it possible to answer questions using data that we cannot see using data that we that we don't have access to let's start with an example so let's say we wanted to answer the question what do tumors look like in humans well the answer this question is pretty complex so perhaps we'll train a classifier people identify tumors in images step one for this would be to acquire a training data set but the kind of data we need to answer this kind of question is very personal it can be legally complicated to buy and sell it so it's likely rough to go to sort of a small number of sources and purchase it and it's likely to be very expensive it's very expensive when they define someone to finance our project so we'll persuade a VC to help back us and if we're gonna find someone to finance our project then that means we need to create a business plan that shows how we're gonna pay them back someday if we're gonna create a business plan and we have to create a business we have to find a business partner and we find a business partner then we have to go spam all of our friends on LinkedIn trying to find someone in business who will help us with this project and all this is because we want to answer the question what do tumors look like in humans but if we wanted to answer a different question what do handwritten digits look like well this would be super simple we would download a data set download a state-of-the-art training script and run it and right then and there within only a few minutes we would have the state-of-the-art classifier with potentially superhuman ability to identify these kinds of patterns so why is there a difference between these two well the big reason is that getting access to private data is really really hard there's a lot of friction involved in the process so much that you basically have to dedicate a portion of your life just to getting access to one particular data set in order to be able to work with it and as a result we spend most of our time working on tasks like this right tasks where the data is publicly available and easy to access and we don't spend nearly as much time working on tasks like this it's a raise your hand if you've ever trained a classifier on the in nest dataset I expected pretty much everybody right so now at raise your hand if you've ever trained a classifier to predict dementia diabetes got one what's your name sandy can I get a round of applause for sandy for working on diabetes but as you can see it's incredibly uncommon that's just because it's really really hard to do and yet if we can state it plainly these tasks represent you know our friends and family and like the things that people are actually suffering real human problems but the issue is that if we want to address real human issues we have to have data about real humans going through those those things and getting access that is exactly the kind of data that is really really hard to get access to and so as a result it's hard to work on these problems and that brings me back to the question I asked at the beginning is it possible to answer questions using data that we cannot see because if the answer to this question is yes then it means it might be possible for us to build infrastructure to make access to private data simpler perhaps even so simple that we can simply pip install access to the world scientific data the same way that we install access to our deep learning frameworks if the answer to this question is yes it might actually be possible on you know in the near future for you to wake up on a Saturday morning in your pajamas and roll on down your kitchen table flip open your laptop and train a classifier on data that's living in say a hundred different hospitals and take a classifier from 96 percent accuracy to 96 point one percent accuracy and save a hundred lives before you even have breakfast or take a classifier and take it from 96 percent accuracy the 96 point one percent accuracy and and help a thousand people you know be a little less lonely work on problems about people so in the next few minutes we're going to talk about a tool built by a community called open mind so open mind is a community of over 5,000 volunteers who care about this problem enough to spend their nights and weekends trying to make privacy preserving AI as easy as possible specifically we're going to talk about a tool called PI sift pi sift extends pie charts with tools for privacy-preserving machine learning and it's my hope that by explaining how some of these tools are working and explaining a few of the features that are either already developed or on our near-term roadmap that you'll be able to see just how easy it might be become to work with privacy-preserving data if we have infrastructure that knows how to protect it so let's begin so the first one we talked about is remote execution so remote execution simply stated is the ability to leverage PI torch on machines that you don't have direct access to let's see what this looks like so the first thing we do is we import sift and we import torch and then we use this thing called a torch hook which augments torch with tools for privacy-preserving machine learnings then we initialize a reference to a remote machine in this case it's one sort of Lading to say Hospital datacenter ok and this allows us to actually interact with PI torch operations and PI torch tensors that live on this machine so in this case we can send a tensor to this hospital datacenter and what it gets returned to us is a pointer this pointer has all the functionality that PI torch would normally have but when you interact with it when you actually use the pie chart API instead of executing it locally it forwards commands to the remote machine and returns back a pointer to the result the implication of these pointer tensors is that now we can use normal PI tort api things we already know how to do to orchestrate complex operations across multiple different remote machines now finally we had this very special command at the bottom here dot get which requests information from the remote machine to be sent back to me more on that in a bit so now we know how to do processing using PI torch on a remote machine thus we can work with data that we never actually call in to our own machines ok that's pretty cool but that opens up another question how do we actually do good data science when we aren't allowed to see the data right well there are a few interesting features we can do for this too so a couple of them are search an example data so let's say we have what's called a grid client so a grid client it's just a collection of workers so previously you know we had a worker point to say a hospital right or Hospital data center this might be a whole collection of hospitals right and this gives us features such as search so let's say I wanted to do some sort of now this really need diabetes right so perhaps just like you I can search for given dataset I get returned back pointers to this remote data and I get back metadata relating to these pointers that explain sort of the schema how it was collected things about the distribution today that are gonna help inform me and my data science project that's that I can then put together a data set that's relevant for my problem that could actually be distributed in multiple different locations and then when I want to do say feature engineering and putting my model together in some cases you can even make available sample data or or data that was sort of synthetically generated to be similar to the distribution that we'll be working on right these are the kind of features that can make it possible for us to do all the same data science techniques that we normally do on data that we don't actually get to see so that's all great we can do sort of remote feature engineering but this still has this mysterious dot get function how do we ensure that when we when we ask for a tensor back from a remote machine that we aren't accidentally also getting back private information this brings us to a third tool differential privacy differential privacy is anyone come across this before okay cool all right quite a few people that's great so I'm gonna do a quick overview the difference of privacy simply stated is a field a group of sort of mathematical algorithms that try to ensure the statistical analysis does not compromise privacy okay so let's say we have this database so it's got a bunch of people in it and it's got a single column and we're going to query this database and we're gonna look at the output of this query we're gonna ask a very important question what is the maximum amount that the output of my query the output of my function could change if I removed John from the database okay if it is if that is zero then I know that the output of this function is not conditioned on John's information right he's not contributing to this query if I could prove that for everyone in the database right by removing them or swapping them with someone else and this this output wouldn't change well then I know that the output of my function does it doesn't depend on any specific individual now it turns out there aren't that many sort of functions that satisfy this very nice property but this intuitive definition is intuitive notion of what perfect privacy would be like we can query a database and perhaps get back some kind of result without divulging any private information is very powerful and it also lends lends clarity to what we do when we have a function that is not perfectly private we had a certain amount of noise to sort of smooth over any potential private information how much noise you might ask let's let's consider an example so um I have a twin sister she works in political science and often they want to do surveys over very sort of taboo behavior alright so they want to understand say how many people are committing a certain kind of crime right or maybe how many pie charts users forget to zero out their gradients before a forward propagating anyway so something that some will be inclined to deceive about if you asked right so let's say I wanted to survey everyone here I want to say okay how many people in here you know jaywalked right so jaywalking is really big deal in California apparently and I was worried that you were going to lie to me alright so what I would do is I would distribute clean to all of you and it's okay flip this coin twice somewhere that I can't see it and I want you to answer truthfully if your first coin flip is a heads and if your first coin flip is a tails I want you to answer true or false yes or no according to the second coin flip so this means that roughly half of you would give me an honest answer and the other half of you would give me a perfect 50/50 you know distribution and I don't know which person is in which group but the powerful thing is is that in expectation the result that I get is the true distribution the true number say so let's say let's say 55% of the survey respondents said yes right then I know that is the center of the distribution is actually 60% which got averaged with a 50/50 coin flip does that make sense so I can back out and get at the aggregate Cystic that I'm interested in without me actually knowing any of your private information and the degree to which these coins are likely to be heads or tails corresponds to the degree of plausible deniability the degree of privacy that you have in this setting so that's all good good well go to theory and I'll get some more resources for how you can learn more about this at the end but what is this actually going to look like in PI - it so let's say we have a pointer to a remote data set right private information and I called doc get whoa big error pops up I'm sorry you tried to request access to private information you can't do that so we have an additional function that get which accepts a parameter called epsilon epsilon is a means by which we can choose how much of our privacy budget we want to spend see the vision that we see happening here is that any given data science project will have a certain privacy budget which is dependent on the level of trust and a kind of relationship that you have with the data owner right it might be zero so you can only do algorithms that that will as a guarantee leak exactly no information or you might have a higher degree of trust individual and so they can do sort of more complex queries and this mechanism is it makes it so that you can track how track private data all the way through to say an in train model or some output of your function and it will automatically add the appropriate amount of noise to make sure that you stay under your privacy budget cool all right now so we've conquered a lot of challenges so far right now we have a form of formal privacy budgeting mechanism but we have a couple of outstanding challenges so whenever I'm doing remote computation when I'm doing remote or analysis let's say training a model on remote data I'm sending my model in and doing training but that means that my model is exposed if it's a really valuable model someone could take it that's not cool and this brings me to the last tool I want to talk about with secure multi-party computation secure in PC anyone all right cool that's XE yeah yeah so secure in PC is the most magical algorithm I have come across since learning about machine learning I am super excited to tell you about it it's really cool so this is the sort of close to the textbook definition but for the context of machine learning the implication this definition is that multiple people can share ownership of a number share ownership of a number let's see how it works so let's say I have the number five and I split it the two shares a two and A three okay to push three equals five I'm a hand waving over a little bit just to go quickly but for the sake of this example two plus three equals five that's how we're going to encrypt this and the interesting thing say I have two friends Mary Ann and Bobby and I give them these two shares are now these shareholders of this number I disappear and now this five is encrypted between the two of them why is it encrypted because neither of them can know what value is actually cryptid between them by looking at their own share you have to look at both chairs to know what the number is and secondarily you get shared governance because the number can only be decrypted if everyone if all the shareholders agree to pool their shares so it's more than just encryption it's it's shared control over a digital asset and the really amazing part is that while it is encrypted we can perform computation let's say we wanted to multiply the number times two simple arithmetic if each person multiplies our shares times two now we haven't encrypted ten and it turns out there's a whole host of protocols in the cryptography community that allow you to do lots of different functions while numbers are in this state including the functions that we need for deep learning and this brings me to the real the real link models and data sets are just large collections of numbers which we can individually encrypt such that models and datasets can have shared ownership and shared governance so what does this look like an impact or CH I have several clients and I said of calling dot said now I have a method called dot share and I pass in a list of shareholders and what gets returned to me is a pointer again with the normal tie torch API which I can then use and it will automatically implement the cryptography under the hood and of course we can do what we can do with tensors we can also do with models so encrypted training encrypted prediction and this brings us to lots of desirable properties useful for privacy preserving machine learning and so this brings it back to the question we start at the beginning is it possible to answer questions using data that we cannot see I believe that it's possible and here's a few examples of tools I think will get us there it's not the full list but I hope that this sort of opens your eyes to the potential that in the very near future we could be able to pip install access to the data to solve the most important problems did we face if you'd like to learn more about this oh yeah if we do this we can spend less time working on those problems and more time working on these which would be great if you'd like to learn more check out this awesome course from Udacity which was sponsored by Facebook thank you very much for your attention have a good night [Music] you
Original Description
Learn the basics of secure and private AI techniques, including federated learning and secure multi-party computation. In this talk, Andrew Trask of OpenMined highlights the importance of privacy preserving machine learning, and how to use privacy-focused tools like PySyft.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from PyTorch · PyTorch · 39 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
▶
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
What is PyTorch?
PyTorch
PyTorch Tutorial: A Quick Preview
PyTorch
PyTorch Summer Hackathon 2019
PyTorch
Tips and Tricks on Hacking with PyTorch: A Quick Tutorial by Brad Heintz
PyTorch
PyTorch 1.2 and PyTorch Hub: A Quick Introduction by Soumith Chintala and Ailing Zhang
PyTorch
Torchtext 0.4 with Supervised Learning Datasets: A Quick Introduction by George Zhang
PyTorch
Torchaudio 0.3 with Kaldi Compatibility, New Transforms: A Quick Introduction by Jason Lian
PyTorch
Torchvision 0.4 with Support for Video: A Quick Introduction by Francisco Massa
PyTorch
Introduction to Machine Learning for Developers at F8 2019
PyTorch
Powered by PyTorch at F8 2019
PyTorch
Developing and Scaling AI Experiences at Facebook with PyTorch at F8 2019
PyTorch
New Approaches to Image and Video Reconstruction Using Deep Learning at Facebook at F8 2019
PyTorch
PyTorch Developer Conference 2018: Recap
PyTorch
PyTorch Developer Conference 2018: Keynote & Deep Dive
PyTorch
PyTorch Developer Conference 2018: Production & Research Sessions
PyTorch
PyTorch Developer Conference 2018: Cloud & Academia Sessions
PyTorch
PyTorch Developer Conference 2018: Enterprise, Education, & Future of AI Panel
PyTorch
PyTorch Developer Conference 2019 | Full Livestream
PyTorch
PyTorch Developer Conference 2019: Recap
PyTorch
PyTorch Developer Conference Keynote - Mike Schroepfer
PyTorch
What’s new in PyTorch 1.3 - Lin Qiao
PyTorch
PyTorch Front-End Features: Named Tensors and Type Promotion - Gregory Chanan
PyTorch
Research to Production: PyTorch JIT/TorchScript Updates - Michael Suo
PyTorch
Quantization - Dmytro Dzhulgakov
PyTorch
PyTorch ONNX Export Support - Lara Haidar, Microsoft
PyTorch
Apex - Michael Carilli, NVIDIA
PyTorch
Dataloader Design for PyTorch - Tongzhou Wang, MIT
PyTorch
Linear Algebra in PyTorch - Vishwak Srinivasan, CMU
PyTorch
PyTorch Mobile - David Reiss
PyTorch
Model Interpretability with Captum - Narine Kokhilkyan
PyTorch
Detectron2 - Next Gen Object Detection Library - Yuxin Wu
PyTorch
Speech Extensions to Fairseq - Dmytro Okhonko
PyTorch
PyTorch on Google Cloud TPUs - Google, Salesforce, Facebook
PyTorch
PyTorch Summer Hackathon Winners - Joe Spisak, Sebastien Arnold, Tristan Deleu
PyTorch
PyTorch in Robotics - Yisong Yue, Caltech
PyTorch
StanfordNLP - Yuhao Zhang, Stanford
PyTorch
Sotabench for Reproducible Research - Robert Stojnic, Papers with Code
PyTorch
Collaborative Natural Language Inference - Sasha Rush, Cornell
PyTorch
Privacy Preserving AI - Andrew Trask, OpenMined
PyTorch
CrypTen - Laurens van der Maaten
PyTorch
PyTorch at Uber - Sidney Zhang, Uber
PyTorch
PyTorch at Tesla - Andrej Karpathy, Tesla
PyTorch
PyTorch at Microsoft - Saurabh Tiwary, Microsoft
PyTorch
PyTorch at Dolby Labs - Vivek Kumar, Dolby Labs
PyTorch
PyTorch Developer Conference 2019 - Panel Discussion
PyTorch
Using deep learning and PyTorch to power next gen aircraft at Caltech
PyTorch
Named Tensors, Model Quantization, and the Latest PyTorch Features - Part 1
PyTorch
TorchScript and PyTorch JIT | Deep Dive
PyTorch
Announcing the PyTorch Global Summer Hackathon 2020
PyTorch
Opening Up the Black Box: Model Understanding with Captum and PyTorch
PyTorch
PyTorch Mobile Runtime for Android
PyTorch
Torchvision in 5 minutes
PyTorch
3D Deep Learning with PyTorch3D
PyTorch
What is Torchtext?
PyTorch
TorchAudio: A Quick Intro
PyTorch
PyTorch Mobile Runtime for iOS
PyTorch
PySlowFast: Deep learning with Video
PyTorch
PyTorch Pruning | How it's Made by Michela Paganini
PyTorch
Measuring Fairness in Machine Learning Systems
PyTorch
PyTorch for Hackathons
PyTorch
More on: ML Maths Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Mastering TypeScript — Understanding the TypeScript Compiler (tsc) from Scratch — Lesson 2
Medium · JavaScript
Stop Overfitting With Basically One Line of Code
Medium · AI
Stop Overfitting With Basically One Line of Code
Medium · Machine Learning
Stop Overfitting With Basically One Line of Code
Medium · Data Science
🎓
Tutor Explanation
DeepCamp AI