Statistical Preconditioning for Distributed Optimization | JRC Workshop 2021
Skills:
ML Maths Basics90%LLM Foundations80%Supervised Learning80%LLM Engineering70%Fine-tuning LLMs60%
Key Takeaways
The video discusses statistical preconditioning for distributed optimization, covering topics such as mirror descent, Bregman divergence, and relative condition number, with applications in machine learning and AI.
Full Transcript
hello everyone this is adrian hendricks and today i'll present you our work statistical preconditioning for distributed optimization this is joint work with ling chao and sebastian beck from microsoft research and francis back and lauren masuriere who are my phd advisors at inria paris so in this presentation we'll consider a distributed empirical risk minimization problem of the following form so the objective capital f can be written as an average of local objectives fg where fj is the local function of node j and you can think of it for instance as the loss function on the local data set of node g so this arrays is for instance when the data set is too big to be held on only one machine which happens a lot in online advertising or for image data sets which have just huge data sets and this is also the case when the data is inherently distributed and should not be for instance think of medical data held by hospitals or just because in general it's quite expensive to move or shuffle data so we adopt a centralized communication model um so in this model we have a server that has the parameter xk and then the server broadcast xk to all the workers the workers they locally compute their local gradients uh premium f5 of xk and then they send it back to the server the server aggregates all the local gradients uses them to compute the four gradients and then updates the parameter xk with this so this is a standard parallel grid in a sense and each iteration requires back and forth communication between the server and all workers so we'll take the following high level approach um our goal is to design a first order centralized algorithm with a low communication complexity so we want the we want it to be as few communication rounds as possible in order to achieve this i will assume that the server locally holds a function fn which approximates the global objective in some sense that we'll see later and the idea is to use this local function fn to change the iterations compared to grainy sense and thus reduce the iteration complexity and how we do this is using a mirror descent and appropriate pregnant life objects so our presentation will go as follows we first use the framework of red smoothness and strong convexity to formalize this notion of how well fn approximates f and show that this is captured by a new quantity the relative condition number kappa f of phi then we use concentration inequalities to give sharp bounds on kappa for the file depending on the size of the server's datasets and finally we'll briefly discuss an accelerated plane so the brain center iterations as you have seen um can be written in the following way so xt plus one is obtained uh by xt minus step size times improvement this can be written in an alternative form and so it's a minimization problem in which we wish to minimize as the dot product between our new parameters and the gradients under the condition that um the new parameter isn't too far from the current point x t so we penalize this problem uh by euclidean distance but actually we could use other uh kind of distances so not necessarily euclidean and if you plug in a bregman live vernon's term here then you obtain new dissents and the dragon divergence relative to f to phi between x and x t is given by the following equation and the important thing is for instance that's when phi is the squared norm then this reduces to greater sense because the gradient divergence reduces to the euclidean distance so what are the convergence properties of mirror distance well is if capital f is sigma f of 5 strongly convex and uh lfo5 smooth relatively to function 5 and so this relative regularity is just that uh the hessian of capital f is lower bounded by sigma times the fashion of phi and upper bounded by l times the hashing of uh so instead of having your hessian which is low down and up bounded just by the identity it's by another matrix and the hashem have another function and so in this case um the mirrored sign converges in kappa f over phi log 1 over epsilon iterations to minimize capital f at precision epsilon where kappa is this relative condition number so that smoothness divided by red strong complexity so compared to gradient sense uh what's interesting is that the kappa f in the bound is replaced by a kappa f alpha and this is interesting because kappa f of i can be much smaller so now the main question that we ask ourselves is uh what's the best phi for a distributed use case and what's the corresponding rate of condition number and the second question is how to accelerate this to obtain a square root kappa for five seconds while following the daily algorithm introduced by xiaomi retal we choose the function phi as fm so the local objective of the server plus a small quadratic term in the quadratic case we show that the relative condition number uh can actually be of order the usual condition number divided by n where n is the size of the server's datasets so this is huge because it means that even if you have a rather small local data set of the server then you can already achieve very good speed up and this just scales very well in the sample size at also in the case of generalized linear models such as logistic regression it's much harder because you don't concentrate matrices you need a concentration uniform concentration of hatchings yet you can still obtain that the relative condition number is of kappa f of square root n which is improvable to kappa f over and in some cases so in some cases you recover actually the convergence prices from your quadratic case what's important here is that our bounds improve the existing ones that are screwed and factor as in the cooler case and they hold outside of the pure product case and we show that the quadratic performance can actually be recorded in some but in quite large number of cases so now i'll just skip over the details of acceleration uh because i don't have time in this presentation and just present you quickly some experiments so this experiment these experiments were done the kdd2010 datasets which has 8 million samples in dimension 20 million we showed we chose the logistic regression objective with a very small organization parameter 13 minus seven and we compare three algorithms so uh day which is uh basically mirror descents uh with five chosen as i showed you a spag which is our accelerated version of thing and uh heavy bolding which is uh so spag is encashed lines and heavy bolding [Music] got dotted line which is another acceleration and uh so we benchmark this algorithm with three three different sizes for the local data sets in orange you have a thousand samples and login in the server in blue you have ten thousand samples and in red you have a hundred thousand samples so even though a hundred thousand samples looks big it's just one percent of the total injection so this really isn't much and in this case you see that um you can converge in a roughly 15 iteration to very very high precisions with accelerated their descent regardless of the method i mean we're slightly faster but it's not that fast but really uh the key thing is that when you have a huge local data set then you go fast and then even when the local data set isn't so big uh so for the blue curve you have only um you have a very small account date slope of 10 to the four so it's 1 000 times smaller than the big data sets and yet thanks to acceleration we achieve a very fast convergence uh to the optimum and in particular a much faster convergence than without the statistical condition and we see that uh when you have only 1000 samples then you still achieve good performances but of course the algorithm becomes kind of so so our contributions are the following and we presented the principal convergence theory based on mere descent we show improved balance on the relative condition number and we have a theory for a synthetic acceleration but we showed that it's actually quite effective in practice uh so there are some questions and one is the tightness of the concentration results and the other one is whether it's possible to prove non-asymptotic acceleration in some specific settings because it's actually impossible to prove it in general that's a negative result so i'm looking forward to hearing questions thank you for listening
Original Description
Artificial Intelligence (AI)
20 May 2021
Speaker: Hadrien Hendrikx, INRIA
(collaboration with Francis Bach, Laurent Massoulié, INRIA and Sébastien Bubeck, Microsoft)
This virtual event brought together the PhD students and postdocs working on collaborative research engagements with Microsoft via the Swiss Joint Research Center, Mixed Reality & AI Zurich Lab, Mixed Reality & AI Cambridge Lab, Inria Joint Center, their academic and Microsoft supervisors as well as the wider research community. The event continued in the tradition of the annual Swiss JRC Workshops. PhD students and postdocs presented project updates and discussed their research with their supervisors and other attendants. In addition, Microsoft speakers provided updates on relevant Microsoft projects and initiatives. There were four event sessions according to research themes: Computer Vision, Systems, and AI
Learn more about the Joint Research Center Workshop 2021: https://www.microsoft.com/en-us/research/event/joint-research-centre-workshop-2021/
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Microsoft Research · Microsoft Research · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Frontiers in ML: Learning from Limited Labeled Data: Challenges and Opportunities for NLP
Microsoft Research
Frontiers in Machine Learning: Climate Impact of Machine Learning
Microsoft Research
Frontiers in Machine Learning: Security and Machine Learning
Microsoft Research
Hope Speech and Help Speech: Surfacing Positivity Amidst Hate
Microsoft Research
Early Indicators of the Effect of the Global Shift to Remote Work on People with Disabilities
Microsoft Research
Remote Work and Well-Being
Microsoft Research
Challenges and Gratitude of Software Developers During COVID-19 Working From Home
Microsoft Research
Towards a Practical Virtual Office for Mobile Knowledge Workers
Microsoft Research
Impact of COVID-19 crisis on the future of work in India
Microsoft Research
Empowering and Supporting Remote Software Development Team Members through a Culture of Allyship
Microsoft Research
How Work From Home Affects Collaboration: Information Workers in a Natural Experiment During COVID19
Microsoft Research
Phong Surface: Efficient 3D Model Fitting using Lifted Optimization
Microsoft Research
Managing Tasks Across the Work-Life Boundary: Opportunities, Challenges, and Directions
Microsoft Research
Microsoft Urban Futures Summer Workshop | Data Driven Urban Transformation [Day 1]
Microsoft Research
Microsoft Urban Futures Summer Workshop | Sensors and Data [Day 2]
Microsoft Research
Microsoft Urban Futures Summer Workshop | Policy and Social Impact [Day 3]
Microsoft Research
Directions in ML: Algorithmic foundations of neural architecture search
Microsoft Research
MineRL Competition 2020
Microsoft Research
Can we make better software by using ML and AI techniques? With Chandra Maddila and Chetan Bansal
Microsoft Research
From Paper to Product
Microsoft Research
SkinnerDB: Regret Bounded Query Evaluation using RL
Microsoft Research
From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks
Microsoft Research
Programming with Proofs for High-assurance Software
Microsoft Research
Platform for Situated Intelligence Overview
Microsoft Research
Directional Sources & Listeners in Interactive Sound Propagation using Reciprocal Wave Field Coding
Microsoft Research
Galactic Bell Star Music Demo
Microsoft Research
Importing Animations in Microsoft Expressive Pixels (9 of 9)
Microsoft Research
Welcome to Microsoft Expressive Pixels (1 of 9)
Microsoft Research
Getting Started with Microsoft Expressive Pixels (2 of 9)
Microsoft Research
Creating an Image in Microsoft Expressive Pixels (3 of 9)
Microsoft Research
Creating Animations in Microsoft Expressive Pixels (4 of 9)
Microsoft Research
Managing Animation Galleries in Microsoft Expressive Pixels (5 of 9)
Microsoft Research
Creating Fragments in Microsoft Expressive Pixels (6 of 9)
Microsoft Research
Using Layers in Microsoft Expressive Pixels (7 of 9)
Microsoft Research
Exporting Animations with Microsoft Expressive Pixels (8 of 9)
Microsoft Research
What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 2/2)
Microsoft Research
What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 1/2)
Microsoft Research
Planeverb: Interactive sound propagation for dynamic scenes using 2D wave simulation
Microsoft Research
Making cryptography accessible, efficient, and scalable with Dr. Divya Gupta and Dr. Rahul Sharma
Microsoft Research
Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 Talk)
Microsoft Research
Optics for the cloud – Light at the end of the tunnel? (SIGCOMM 2020 Workshop)
Microsoft Research
Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 short talk)
Microsoft Research
Sirius: A Flat Datacenter Network with Nanosecond Optical Switching (SIGCOMM 2020 short talk)
Microsoft Research
Novel Image Captioning
Microsoft Research
Forest Sound Scene Simulation and Bird Localization with Distributed Microphone Arrays
Microsoft Research
Decoding Music Attention from “EEG headphones”: a User-friendly Auditory Brain-computer Interface
Microsoft Research
How does holographic storage work?
Microsoft Research
The physics of hologram formation in iron doped lithium niobate
Microsoft Research
Introduction to coax: A Modular RL Package
Microsoft Research
Directions in ML: "Neural architecture search: Coming of age"
Microsoft Research
Microsoft Research AI Breakthroughs 2020: 20 minute research talks + Q&A panel
Microsoft Research
Fireside Chat with Johannes Gehrke during Microsoft Research AI Breakthroughs 2020
Microsoft Research
Fireside Chat with Susan Dumais during Microsoft Research AI Breakthroughs 2020
Microsoft Research
Microsoft Research AI Breakthroughs 2020: 20 minute research talks, Q&A panel, and event wrap-up
Microsoft Research
Clinical Research with FHIR
Microsoft Research
Soundscape Street Preview
Microsoft Research
Tilt-Responsive Techniques for Digital Drawing Boards
Microsoft Research
SurfaceFleet: Exploring Distributed Interactions Unbounded from Device, Application, User, and Time
Microsoft Research
Haptic PIVOT: On-Demand Handhelds in VR
Microsoft Research
SurfaceFleet Supplemental Video Demonstration (UIST 2020)
Microsoft Research
More on: ML Maths Basics
View skill →Related Reads
📰
📰
📰
📰
Is the AI bubble about to burst? A data scientist’s honest take
Medium · AI
Is the AI bubble about to burst? A data scientist’s honest take
Medium · Machine Learning
Is the AI bubble about to burst? A data scientist’s honest take
Medium · Data Science
Is the AI bubble about to burst? A data scientist’s honest take
Medium · Startup
🎓
Tutor Explanation
DeepCamp AI