Statistical Preconditioning for Distributed Optimization | JRC Workshop 2021

Microsoft Research · Intermediate ·📰 AI News & Updates ·5y ago

Key Takeaways

The video discusses statistical preconditioning for distributed optimization, covering topics such as mirror descent, Bregman divergence, and relative condition number, with applications in machine learning and AI.

Full Transcript

hello everyone this is adrian hendricks and today i'll present you our work statistical preconditioning for distributed optimization this is joint work with ling chao and sebastian beck from microsoft research and francis back and lauren masuriere who are my phd advisors at inria paris so in this presentation we'll consider a distributed empirical risk minimization problem of the following form so the objective capital f can be written as an average of local objectives fg where fj is the local function of node j and you can think of it for instance as the loss function on the local data set of node g so this arrays is for instance when the data set is too big to be held on only one machine which happens a lot in online advertising or for image data sets which have just huge data sets and this is also the case when the data is inherently distributed and should not be for instance think of medical data held by hospitals or just because in general it's quite expensive to move or shuffle data so we adopt a centralized communication model um so in this model we have a server that has the parameter xk and then the server broadcast xk to all the workers the workers they locally compute their local gradients uh premium f5 of xk and then they send it back to the server the server aggregates all the local gradients uses them to compute the four gradients and then updates the parameter xk with this so this is a standard parallel grid in a sense and each iteration requires back and forth communication between the server and all workers so we'll take the following high level approach um our goal is to design a first order centralized algorithm with a low communication complexity so we want the we want it to be as few communication rounds as possible in order to achieve this i will assume that the server locally holds a function fn which approximates the global objective in some sense that we'll see later and the idea is to use this local function fn to change the iterations compared to grainy sense and thus reduce the iteration complexity and how we do this is using a mirror descent and appropriate pregnant life objects so our presentation will go as follows we first use the framework of red smoothness and strong convexity to formalize this notion of how well fn approximates f and show that this is captured by a new quantity the relative condition number kappa f of phi then we use concentration inequalities to give sharp bounds on kappa for the file depending on the size of the server's datasets and finally we'll briefly discuss an accelerated plane so the brain center iterations as you have seen um can be written in the following way so xt plus one is obtained uh by xt minus step size times improvement this can be written in an alternative form and so it's a minimization problem in which we wish to minimize as the dot product between our new parameters and the gradients under the condition that um the new parameter isn't too far from the current point x t so we penalize this problem uh by euclidean distance but actually we could use other uh kind of distances so not necessarily euclidean and if you plug in a bregman live vernon's term here then you obtain new dissents and the dragon divergence relative to f to phi between x and x t is given by the following equation and the important thing is for instance that's when phi is the squared norm then this reduces to greater sense because the gradient divergence reduces to the euclidean distance so what are the convergence properties of mirror distance well is if capital f is sigma f of 5 strongly convex and uh lfo5 smooth relatively to function 5 and so this relative regularity is just that uh the hessian of capital f is lower bounded by sigma times the fashion of phi and upper bounded by l times the hashing of uh so instead of having your hessian which is low down and up bounded just by the identity it's by another matrix and the hashem have another function and so in this case um the mirrored sign converges in kappa f over phi log 1 over epsilon iterations to minimize capital f at precision epsilon where kappa is this relative condition number so that smoothness divided by red strong complexity so compared to gradient sense uh what's interesting is that the kappa f in the bound is replaced by a kappa f alpha and this is interesting because kappa f of i can be much smaller so now the main question that we ask ourselves is uh what's the best phi for a distributed use case and what's the corresponding rate of condition number and the second question is how to accelerate this to obtain a square root kappa for five seconds while following the daily algorithm introduced by xiaomi retal we choose the function phi as fm so the local objective of the server plus a small quadratic term in the quadratic case we show that the relative condition number uh can actually be of order the usual condition number divided by n where n is the size of the server's datasets so this is huge because it means that even if you have a rather small local data set of the server then you can already achieve very good speed up and this just scales very well in the sample size at also in the case of generalized linear models such as logistic regression it's much harder because you don't concentrate matrices you need a concentration uniform concentration of hatchings yet you can still obtain that the relative condition number is of kappa f of square root n which is improvable to kappa f over and in some cases so in some cases you recover actually the convergence prices from your quadratic case what's important here is that our bounds improve the existing ones that are screwed and factor as in the cooler case and they hold outside of the pure product case and we show that the quadratic performance can actually be recorded in some but in quite large number of cases so now i'll just skip over the details of acceleration uh because i don't have time in this presentation and just present you quickly some experiments so this experiment these experiments were done the kdd2010 datasets which has 8 million samples in dimension 20 million we showed we chose the logistic regression objective with a very small organization parameter 13 minus seven and we compare three algorithms so uh day which is uh basically mirror descents uh with five chosen as i showed you a spag which is our accelerated version of thing and uh heavy bolding which is uh so spag is encashed lines and heavy bolding [Music] got dotted line which is another acceleration and uh so we benchmark this algorithm with three three different sizes for the local data sets in orange you have a thousand samples and login in the server in blue you have ten thousand samples and in red you have a hundred thousand samples so even though a hundred thousand samples looks big it's just one percent of the total injection so this really isn't much and in this case you see that um you can converge in a roughly 15 iteration to very very high precisions with accelerated their descent regardless of the method i mean we're slightly faster but it's not that fast but really uh the key thing is that when you have a huge local data set then you go fast and then even when the local data set isn't so big uh so for the blue curve you have only um you have a very small account date slope of 10 to the four so it's 1 000 times smaller than the big data sets and yet thanks to acceleration we achieve a very fast convergence uh to the optimum and in particular a much faster convergence than without the statistical condition and we see that uh when you have only 1000 samples then you still achieve good performances but of course the algorithm becomes kind of so so our contributions are the following and we presented the principal convergence theory based on mere descent we show improved balance on the relative condition number and we have a theory for a synthetic acceleration but we showed that it's actually quite effective in practice uh so there are some questions and one is the tightness of the concentration results and the other one is whether it's possible to prove non-asymptotic acceleration in some specific settings because it's actually impossible to prove it in general that's a negative result so i'm looking forward to hearing questions thank you for listening

Original Description

Artificial Intelligence (AI) 20 May 2021 Speaker: Hadrien Hendrikx, INRIA (collaboration with Francis Bach, Laurent Massoulié, INRIA and Sébastien Bubeck, Microsoft) This virtual event brought together the PhD students and postdocs working on collaborative research engagements with Microsoft via the Swiss Joint Research Center, Mixed Reality & AI Zurich Lab, Mixed Reality & AI Cambridge Lab, Inria Joint Center, their academic and Microsoft supervisors as well as the wider research community. The event continued in the tradition of the annual Swiss JRC Workshops. PhD students and postdocs presented project updates and discussed their research with their supervisors and other attendants. In addition, Microsoft speakers provided updates on relevant Microsoft projects and initiatives. There were four event sessions according to research themes: Computer Vision, Systems, and AI Learn more about the Joint Research Center Workshop 2021: https://www.microsoft.com/en-us/research/event/joint-research-centre-workshop-2021/
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Microsoft Research · Microsoft Research · 0 of 60

← Previous Next →
1 Frontiers in ML: Learning from Limited Labeled Data: Challenges and Opportunities for NLP
Frontiers in ML: Learning from Limited Labeled Data: Challenges and Opportunities for NLP
Microsoft Research
2 Frontiers in Machine Learning: Climate Impact of Machine Learning
Frontiers in Machine Learning: Climate Impact of Machine Learning
Microsoft Research
3 Frontiers in Machine Learning: Security and Machine Learning
Frontiers in Machine Learning: Security and Machine Learning
Microsoft Research
4 Hope Speech and Help Speech: Surfacing Positivity Amidst Hate
Hope Speech and Help Speech: Surfacing Positivity Amidst Hate
Microsoft Research
5 Early Indicators of the Effect of the Global Shift to Remote Work on People with Disabilities
Early Indicators of the Effect of the Global Shift to Remote Work on People with Disabilities
Microsoft Research
6 Remote Work and Well-Being
Remote Work and Well-Being
Microsoft Research
7 Challenges and Gratitude of Software Developers During COVID-19 Working From Home
Challenges and Gratitude of Software Developers During COVID-19 Working From Home
Microsoft Research
8 Towards a Practical Virtual Office for Mobile Knowledge Workers
Towards a Practical Virtual Office for Mobile Knowledge Workers
Microsoft Research
9 Impact of COVID-19 crisis on the future of work in India
Impact of COVID-19 crisis on the future of work in India
Microsoft Research
10 Empowering and Supporting Remote Software Development Team Members through a Culture of Allyship
Empowering and Supporting Remote Software Development Team Members through a Culture of Allyship
Microsoft Research
11 How Work From Home Affects Collaboration: Information Workers in a Natural Experiment During COVID19
How Work From Home Affects Collaboration: Information Workers in a Natural Experiment During COVID19
Microsoft Research
12 Phong Surface: Efficient 3D Model Fitting using Lifted Optimization
Phong Surface: Efficient 3D Model Fitting using Lifted Optimization
Microsoft Research
13 Managing Tasks Across the Work-Life Boundary: Opportunities, Challenges, and Directions
Managing Tasks Across the Work-Life Boundary: Opportunities, Challenges, and Directions
Microsoft Research
14 Microsoft Urban Futures Summer Workshop | Data Driven Urban Transformation [Day 1]
Microsoft Urban Futures Summer Workshop | Data Driven Urban Transformation [Day 1]
Microsoft Research
15 Microsoft Urban Futures Summer Workshop | Sensors and Data [Day 2]
Microsoft Urban Futures Summer Workshop | Sensors and Data [Day 2]
Microsoft Research
16 Microsoft Urban Futures Summer Workshop | Policy and Social Impact [Day 3]
Microsoft Urban Futures Summer Workshop | Policy and Social Impact [Day 3]
Microsoft Research
17 Directions in ML: Algorithmic foundations of neural architecture search
Directions in ML: Algorithmic foundations of neural architecture search
Microsoft Research
18 MineRL Competition 2020
MineRL Competition 2020
Microsoft Research
19 Can we make better software by using ML and AI techniques? With Chandra Maddila and Chetan Bansal
Can we make better software by using ML and AI techniques? With Chandra Maddila and Chetan Bansal
Microsoft Research
20 From Paper to Product
From Paper to Product
Microsoft Research
21 SkinnerDB: Regret Bounded Query Evaluation using RL
SkinnerDB: Regret Bounded Query Evaluation using RL
Microsoft Research
22 From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks
From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks
Microsoft Research
23 Programming with Proofs for High-assurance Software
Programming with Proofs for High-assurance Software
Microsoft Research
24 Platform for Situated Intelligence Overview
Platform for Situated Intelligence Overview
Microsoft Research
25 Directional Sources & Listeners in Interactive Sound Propagation using Reciprocal Wave Field Coding
Directional Sources & Listeners in Interactive Sound Propagation using Reciprocal Wave Field Coding
Microsoft Research
26 Galactic Bell Star Music Demo
Galactic Bell Star Music Demo
Microsoft Research
27 Importing Animations in Microsoft Expressive Pixels (9 of 9)
Importing Animations in Microsoft Expressive Pixels (9 of 9)
Microsoft Research
28 Welcome to Microsoft Expressive Pixels (1 of 9)
Welcome to Microsoft Expressive Pixels (1 of 9)
Microsoft Research
29 Getting Started with Microsoft Expressive Pixels (2 of 9)
Getting Started with Microsoft Expressive Pixels (2 of 9)
Microsoft Research
30 Creating an Image in Microsoft Expressive Pixels (3 of 9)
Creating an Image in Microsoft Expressive Pixels (3 of 9)
Microsoft Research
31 Creating Animations in Microsoft Expressive Pixels (4 of 9)
Creating Animations in Microsoft Expressive Pixels (4 of 9)
Microsoft Research
32 Managing Animation Galleries in Microsoft Expressive Pixels (5 of 9)
Managing Animation Galleries in Microsoft Expressive Pixels (5 of 9)
Microsoft Research
33 Creating Fragments in Microsoft Expressive Pixels (6 of 9)
Creating Fragments in Microsoft Expressive Pixels (6 of 9)
Microsoft Research
34 Using Layers in Microsoft Expressive Pixels (7 of 9)
Using Layers in Microsoft Expressive Pixels (7 of 9)
Microsoft Research
35 Exporting Animations with Microsoft Expressive Pixels (8 of 9)
Exporting Animations with Microsoft Expressive Pixels (8 of 9)
Microsoft Research
36 What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 2/2)
What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 2/2)
Microsoft Research
37 What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 1/2)
What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 1/2)
Microsoft Research
38 Planeverb: Interactive sound propagation for dynamic scenes using 2D wave simulation
Planeverb: Interactive sound propagation for dynamic scenes using 2D wave simulation
Microsoft Research
39 Making cryptography accessible, efficient, and scalable with Dr. Divya Gupta and Dr. Rahul Sharma
Making cryptography accessible, efficient, and scalable with Dr. Divya Gupta and Dr. Rahul Sharma
Microsoft Research
40 Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 Talk)
Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 Talk)
Microsoft Research
41 Optics for the cloud – Light at the end of the tunnel? (SIGCOMM 2020 Workshop)
Optics for the cloud – Light at the end of the tunnel? (SIGCOMM 2020 Workshop)
Microsoft Research
42 Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 short talk)
Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 short talk)
Microsoft Research
43 Sirius: A Flat Datacenter Network with Nanosecond Optical Switching (SIGCOMM 2020 short talk)
Sirius: A Flat Datacenter Network with Nanosecond Optical Switching (SIGCOMM 2020 short talk)
Microsoft Research
44 Novel Image Captioning
Novel Image Captioning
Microsoft Research
45 Forest Sound Scene Simulation and Bird Localization with Distributed Microphone Arrays
Forest Sound Scene Simulation and Bird Localization with Distributed Microphone Arrays
Microsoft Research
46 Decoding Music Attention from “EEG headphones”: a User-friendly Auditory Brain-computer Interface
Decoding Music Attention from “EEG headphones”: a User-friendly Auditory Brain-computer Interface
Microsoft Research
47 How does holographic storage work?
How does holographic storage work?
Microsoft Research
48 The physics of hologram formation in iron doped lithium niobate
The physics of hologram formation in iron doped lithium niobate
Microsoft Research
49 Introduction to coax: A Modular RL Package
Introduction to coax: A Modular RL Package
Microsoft Research
50 Directions in ML: "Neural architecture search: Coming of age"
Directions in ML: "Neural architecture search: Coming of age"
Microsoft Research
51 Microsoft Research AI Breakthroughs 2020: 20 minute research talks + Q&A panel
Microsoft Research AI Breakthroughs 2020: 20 minute research talks + Q&A panel
Microsoft Research
52 Fireside Chat with Johannes Gehrke during Microsoft Research AI Breakthroughs 2020
Fireside Chat with Johannes Gehrke during Microsoft Research AI Breakthroughs 2020
Microsoft Research
53 Fireside Chat with Susan Dumais during Microsoft Research AI Breakthroughs 2020
Fireside Chat with Susan Dumais during Microsoft Research AI Breakthroughs 2020
Microsoft Research
54 Microsoft Research AI Breakthroughs 2020: 20 minute research talks, Q&A panel, and event wrap-up
Microsoft Research AI Breakthroughs 2020: 20 minute research talks, Q&A panel, and event wrap-up
Microsoft Research
55 Clinical Research with FHIR
Clinical Research with FHIR
Microsoft Research
56 Soundscape Street Preview
Soundscape Street Preview
Microsoft Research
57 Tilt-Responsive Techniques for Digital Drawing Boards
Tilt-Responsive Techniques for Digital Drawing Boards
Microsoft Research
58 SurfaceFleet: Exploring Distributed Interactions Unbounded from Device, Application, User, and Time
SurfaceFleet: Exploring Distributed Interactions Unbounded from Device, Application, User, and Time
Microsoft Research
59 Haptic PIVOT: On-Demand Handhelds in VR
Haptic PIVOT: On-Demand Handhelds in VR
Microsoft Research
60 SurfaceFleet Supplemental Video Demonstration (UIST 2020)
SurfaceFleet Supplemental Video Demonstration (UIST 2020)
Microsoft Research

This video teaches the application of statistical preconditioning to distributed optimization problems, covering key concepts such as mirror descent, Bregman divergence, and relative condition number. The speaker discusses the benefits of statistical preconditioning, including improved convergence rates and quadratic performance. By watching this video, viewers can gain a deeper understanding of distributed optimization and statistical preconditioning, and learn how to apply these concepts to re

Key Takeaways
  1. Use mirror descent and Bregman divergence to optimize distributed optimization problems
  2. Minimize the dot product between new parameters and gradients
  3. Penalize the problem by Euclidean distance
  4. Use concentration inequalities for sharp bounds on kappa f
  5. Evaluate the effectiveness of statistical preconditioning in various scenarios
💡 Statistical preconditioning can achieve fast convergence to the optimum in distributed optimization even with small data sets, and can improve the relative condition number.

Related Reads

📰
Is the AI bubble about to burst? A data scientist’s honest take
A data scientist shares their honest take on whether the AI bubble is about to burst, providing an informed perspective on the technology's potential and limitations
Medium · AI
📰
Is the AI bubble about to burst? A data scientist’s honest take
A data scientist shares their honest take on whether the AI bubble is about to burst, providing a grounded perspective on the technology
Medium · Machine Learning
📰
Is the AI bubble about to burst? A data scientist’s honest take
A data scientist shares their honest take on whether the AI bubble is about to burst, providing a grounded perspective on the technology
Medium · Data Science
📰
Is the AI bubble about to burst? A data scientist’s honest take
A data scientist shares their honest take on whether the AI bubble is about to burst, providing a grounded perspective on the technology
Medium · Startup
Up next
Tackling Malaria in Africa with Technology at the Huawei ICT Competition
Huawei
Watch →