Statistical Preconditioning for Distributed Optimization | JRC Workshop 2021

Microsoft Research · Intermediate ·📰 AI News & Updates ·5y ago

Skills: ML Maths Basics90%LLM Foundations80%Supervised Learning80%LLM Engineering70%Fine-tuning LLMs60%

Key Takeaways

The video discusses statistical preconditioning for distributed optimization, covering topics such as mirror descent, Bregman divergence, and relative condition number, with applications in machine learning and AI.

Full Transcript

hello everyone this is adrian hendricks and today i'll present you our work statistical preconditioning for distributed optimization this is joint work with ling chao and sebastian beck from microsoft research and francis back and lauren masuriere who are my phd advisors at inria paris so in this presentation we'll consider a distributed empirical risk minimization problem of the following form so the objective capital f can be written as an average of local objectives fg where fj is the local function of node j and you can think of it for instance as the loss function on the local data set of node g so this arrays is for instance when the data set is too big to be held on only one machine which happens a lot in online advertising or for image data sets which have just huge data sets and this is also the case when the data is inherently distributed and should not be for instance think of medical data held by hospitals or just because in general it's quite expensive to move or shuffle data so we adopt a centralized communication model um so in this model we have a server that has the parameter xk and then the server broadcast xk to all the workers the workers they locally compute their local gradients uh premium f5 of xk and then they send it back to the server the server aggregates all the local gradients uses them to compute the four gradients and then updates the parameter xk with this so this is a standard parallel grid in a sense and each iteration requires back and forth communication between the server and all workers so we'll take the following high level approach um our goal is to design a first order centralized algorithm with a low communication complexity so we want the we want it to be as few communication rounds as possible in order to achieve this i will assume that the server locally holds a function fn which approximates the global objective in some sense that we'll see later and the idea is to use this local function fn to change the iterations compared to grainy sense and thus reduce the iteration complexity and how we do this is using a mirror descent and appropriate pregnant life objects so our presentation will go as follows we first use the framework of red smoothness and strong convexity to formalize this notion of how well fn approximates f and show that this is captured by a new quantity the relative condition number kappa f of phi then we use concentration inequalities to give sharp bounds on kappa for the file depending on the size of the server's datasets and finally we'll briefly discuss an accelerated plane so the brain center iterations as you have seen um can be written in the following way so xt plus one is obtained uh by xt minus step size times improvement this can be written in an alternative form and so it's a minimization problem in which we wish to minimize as the dot product between our new parameters and the gradients under the condition that um the new parameter isn't too far from the current point x t so we penalize this problem uh by euclidean distance but actually we could use other uh kind of distances so not necessarily euclidean and if you plug in a bregman live vernon's term here then you obtain new dissents and the dragon divergence relative to f to phi between x and x t is given by the following equation and the important thing is for instance that's when phi is the squared norm then this reduces to greater sense because the gradient divergence reduces to the euclidean distance so what are the convergence properties of mirror distance well is if capital f is sigma f of 5 strongly convex and uh lfo5 smooth relatively to function 5 and so this relative regularity is just that uh the hessian of capital f is lower bounded by sigma times the fashion of phi and upper bounded by l times the hashing of uh so instead of having your hessian which is low down and up bounded just by the identity it's by another matrix and the hashem have another function and so in this case um the mirrored sign converges in kappa f over phi log 1 over epsilon iterations to minimize capital f at precision epsilon where kappa is this relative condition number so that smoothness divided by red strong complexity so compared to gradient sense uh what's interesting is that the kappa f in the bound is replaced by a kappa f alpha and this is interesting because kappa f of i can be much smaller so now the main question that we ask ourselves is uh what's the best phi for a distributed use case and what's the corresponding rate of condition number and the second question is how to accelerate this to obtain a square root kappa for five seconds while following the daily algorithm introduced by xiaomi retal we choose the function phi as fm so the local objective of the server plus a small quadratic term in the quadratic case we show that the relative condition number uh can actually be of order the usual condition number divided by n where n is the size of the server's datasets so this is huge because it means that even if you have a rather small local data set of the server then you can already achieve very good speed up and this just scales very well in the sample size at also in the case of generalized linear models such as logistic regression it's much harder because you don't concentrate matrices you need a concentration uniform concentration of hatchings yet you can still obtain that the relative condition number is of kappa f of square root n which is improvable to kappa f over and in some cases so in some cases you recover actually the convergence prices from your quadratic case what's important here is that our bounds improve the existing ones that are screwed and factor as in the cooler case and they hold outside of the pure product case and we show that the quadratic performance can actually be recorded in some but in quite large number of cases so now i'll just skip over the details of acceleration uh because i don't have time in this presentation and just present you quickly some experiments so this experiment these experiments were done the kdd2010 datasets which has 8 million samples in dimension 20 million we showed we chose the logistic regression objective with a very small organization parameter 13 minus seven and we compare three algorithms so uh day which is uh basically mirror descents uh with five chosen as i showed you a spag which is our accelerated version of thing and uh heavy bolding which is uh so spag is encashed lines and heavy bolding [Music] got dotted line which is another acceleration and uh so we benchmark this algorithm with three three different sizes for the local data sets in orange you have a thousand samples and login in the server in blue you have ten thousand samples and in red you have a hundred thousand samples so even though a hundred thousand samples looks big it's just one percent of the total injection so this really isn't much and in this case you see that um you can converge in a roughly 15 iteration to very very high precisions with accelerated their descent regardless of the method i mean we're slightly faster but it's not that fast but really uh the key thing is that when you have a huge local data set then you go fast and then even when the local data set isn't so big uh so for the blue curve you have only um you have a very small account date slope of 10 to the four so it's 1 000 times smaller than the big data sets and yet thanks to acceleration we achieve a very fast convergence uh to the optimum and in particular a much faster convergence than without the statistical condition and we see that uh when you have only 1000 samples then you still achieve good performances but of course the algorithm becomes kind of so so our contributions are the following and we presented the principal convergence theory based on mere descent we show improved balance on the relative condition number and we have a theory for a synthetic acceleration but we showed that it's actually quite effective in practice uh so there are some questions and one is the tightness of the concentration results and the other one is whether it's possible to prove non-asymptotic acceleration in some specific settings because it's actually impossible to prove it in general that's a negative result so i'm looking forward to hearing questions thank you for listening

Original Description

Artificial Intelligence (AI) 20 May 2021 Speaker: Hadrien Hendrikx, INRIA (collaboration with Francis Bach, Laurent Massoulié, INRIA and Sébastien Bubeck, Microsoft) This virtual event brought together the PhD students and postdocs working on collaborative research engagements with Microsoft via the Swiss Joint Research Center, Mixed Reality & AI Zurich Lab, Mixed Reality & AI Cambridge Lab, Inria Joint Center, their academic and Microsoft supervisors as well as the wider research community. The event continued in the tradition of the annual Swiss JRC Workshops. PhD students and postdocs presented project updates and discussed their research with their supervisors and other attendants. In addition, Microsoft speakers provided updates on relevant Microsoft projects and initiatives. There were four event sessions according to research themes: Computer Vision, Systems, and AI Learn more about the Joint Research Center Workshop 2021: https://www.microsoft.com/en-us/research/event/joint-research-centre-workshop-2021/

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Microsoft Research · Microsoft Research · 0 of 60

← Previous Next →

Frontiers in ML: Learning from Limited Labeled Data: Challenges and Opportunities for NLP

Frontiers in ML: Learning from Limited Labeled Data: Challenges and Opportunities for NLP

Microsoft Research

Frontiers in Machine Learning: Climate Impact of Machine Learning

Frontiers in Machine Learning: Climate Impact of Machine Learning

Microsoft Research

Frontiers in Machine Learning: Security and Machine Learning

Frontiers in Machine Learning: Security and Machine Learning

Microsoft Research

Hope Speech and Help Speech: Surfacing Positivity Amidst Hate

Hope Speech and Help Speech: Surfacing Positivity Amidst Hate

Microsoft Research

Early Indicators of the Effect of the Global Shift to Remote Work on People with Disabilities

Early Indicators of the Effect of the Global Shift to Remote Work on People with Disabilities

Microsoft Research

Remote Work and Well-Being

Remote Work and Well-Being

Microsoft Research

Challenges and Gratitude of Software Developers During COVID-19 Working From Home

Challenges and Gratitude of Software Developers During COVID-19 Working From Home

Microsoft Research

Towards a Practical Virtual Office for Mobile Knowledge Workers

Towards a Practical Virtual Office for Mobile Knowledge Workers

Microsoft Research

Impact of COVID-19 crisis on the future of work in India

Impact of COVID-19 crisis on the future of work in India

Microsoft Research

Empowering and Supporting Remote Software Development Team Members through a Culture of Allyship

Empowering and Supporting Remote Software Development Team Members through a Culture of Allyship

Microsoft Research

How Work From Home Affects Collaboration: Information Workers in a Natural Experiment During COVID19

How Work From Home Affects Collaboration: Information Workers in a Natural Experiment During COVID19

Microsoft Research

Phong Surface: Efficient 3D Model Fitting using Lifted Optimization

Phong Surface: Efficient 3D Model Fitting using Lifted Optimization

Microsoft Research

Managing Tasks Across the Work-Life Boundary: Opportunities, Challenges, and Directions

Managing Tasks Across the Work-Life Boundary: Opportunities, Challenges, and Directions

Microsoft Research

Microsoft Urban Futures Summer Workshop | Data Driven Urban Transformation [Day 1]

Microsoft Urban Futures Summer Workshop | Data Driven Urban Transformation [Day 1]

Microsoft Research

Microsoft Urban Futures Summer Workshop | Sensors and Data [Day 2]

Microsoft Urban Futures Summer Workshop | Sensors and Data [Day 2]

Microsoft Research

Microsoft Urban Futures Summer Workshop | Policy and Social Impact [Day 3]

Microsoft Urban Futures Summer Workshop | Policy and Social Impact [Day 3]

Microsoft Research

Directions in ML: Algorithmic foundations of neural architecture search

Directions in ML: Algorithmic foundations of neural architecture search

Microsoft Research

MineRL Competition 2020

MineRL Competition 2020

Microsoft Research

Can we make better software by using ML and AI techniques? With Chandra Maddila and Chetan Bansal

Can we make better software by using ML and AI techniques? With Chandra Maddila and Chetan Bansal

Microsoft Research

From Paper to Product

From Paper to Product

Microsoft Research

SkinnerDB: Regret Bounded Query Evaluation using RL

SkinnerDB: Regret Bounded Query Evaluation using RL

Microsoft Research

From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks

From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks

Microsoft Research

Programming with Proofs for High-assurance Software

Programming with Proofs for High-assurance Software

Microsoft Research

Platform for Situated Intelligence Overview

Platform for Situated Intelligence Overview

Microsoft Research

Directional Sources & Listeners in Interactive Sound Propagation using Reciprocal Wave Field Coding

Directional Sources & Listeners in Interactive Sound Propagation using Reciprocal Wave Field Coding

Microsoft Research

Galactic Bell Star Music Demo

Galactic Bell Star Music Demo

Microsoft Research

Importing Animations in Microsoft Expressive Pixels (9 of 9)

Importing Animations in Microsoft Expressive Pixels (9 of 9)

Microsoft Research

Welcome to Microsoft Expressive Pixels (1 of 9)

Welcome to Microsoft Expressive Pixels (1 of 9)

Microsoft Research

Getting Started with Microsoft Expressive Pixels (2 of 9)

Getting Started with Microsoft Expressive Pixels (2 of 9)

Microsoft Research

Creating an Image in Microsoft Expressive Pixels (3 of 9)

Creating an Image in Microsoft Expressive Pixels (3 of 9)

Microsoft Research

Creating Animations in Microsoft Expressive Pixels (4 of 9)

Creating Animations in Microsoft Expressive Pixels (4 of 9)

Microsoft Research

Managing Animation Galleries in Microsoft Expressive Pixels (5 of 9)

Managing Animation Galleries in Microsoft Expressive Pixels (5 of 9)

Microsoft Research

Creating Fragments in Microsoft Expressive Pixels (6 of 9)

Creating Fragments in Microsoft Expressive Pixels (6 of 9)

Microsoft Research

Using Layers in Microsoft Expressive Pixels (7 of 9)

Using Layers in Microsoft Expressive Pixels (7 of 9)

Microsoft Research

Exporting Animations with Microsoft Expressive Pixels (8 of 9)

Exporting Animations with Microsoft Expressive Pixels (8 of 9)

Microsoft Research

What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 2/2)

What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 2/2)

Microsoft Research

What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 1/2)

What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 1/2)

Microsoft Research

Planeverb: Interactive sound propagation for dynamic scenes using 2D wave simulation

Planeverb: Interactive sound propagation for dynamic scenes using 2D wave simulation

Microsoft Research

Making cryptography accessible, efficient, and scalable with Dr. Divya Gupta and Dr. Rahul Sharma

Making cryptography accessible, efficient, and scalable with Dr. Divya Gupta and Dr. Rahul Sharma

Microsoft Research

Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 Talk)

Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 Talk)

Microsoft Research

Optics for the cloud – Light at the end of the tunnel? (SIGCOMM 2020 Workshop)

Optics for the cloud – Light at the end of the tunnel? (SIGCOMM 2020 Workshop)

Microsoft Research

Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 short talk)

Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 short talk)

Microsoft Research

Sirius: A Flat Datacenter Network with Nanosecond Optical Switching (SIGCOMM 2020 short talk)

Sirius: A Flat Datacenter Network with Nanosecond Optical Switching (SIGCOMM 2020 short talk)

Microsoft Research

Novel Image Captioning

Novel Image Captioning

Microsoft Research

Forest Sound Scene Simulation and Bird Localization with Distributed Microphone Arrays

Forest Sound Scene Simulation and Bird Localization with Distributed Microphone Arrays

Microsoft Research

Decoding Music Attention from “EEG headphones”: a User-friendly Auditory Brain-computer Interface

Decoding Music Attention from “EEG headphones”: a User-friendly Auditory Brain-computer Interface

Microsoft Research

How does holographic storage work?

How does holographic storage work?

Microsoft Research

The physics of hologram formation in iron doped lithium niobate

The physics of hologram formation in iron doped lithium niobate

Microsoft Research

Introduction to coax: A Modular RL Package

Introduction to coax: A Modular RL Package

Microsoft Research

Directions in ML: "Neural architecture search: Coming of age"

Directions in ML: "Neural architecture search: Coming of age"

Microsoft Research

Microsoft Research AI Breakthroughs 2020: 20 minute research talks + Q&A panel

Microsoft Research AI Breakthroughs 2020: 20 minute research talks + Q&A panel

Microsoft Research

Fireside Chat with Johannes Gehrke during Microsoft Research AI Breakthroughs 2020

Fireside Chat with Johannes Gehrke during Microsoft Research AI Breakthroughs 2020

Microsoft Research

Fireside Chat with Susan Dumais during Microsoft Research AI Breakthroughs 2020

Fireside Chat with Susan Dumais during Microsoft Research AI Breakthroughs 2020

Microsoft Research

Microsoft Research AI Breakthroughs 2020: 20 minute research talks, Q&A panel, and event wrap-up

Microsoft Research AI Breakthroughs 2020: 20 minute research talks, Q&A panel, and event wrap-up

Microsoft Research

Clinical Research with FHIR

Clinical Research with FHIR

Microsoft Research

Soundscape Street Preview

Soundscape Street Preview

Microsoft Research

Tilt-Responsive Techniques for Digital Drawing Boards

Tilt-Responsive Techniques for Digital Drawing Boards

Microsoft Research

SurfaceFleet: Exploring Distributed Interactions Unbounded from Device, Application, User, and Time

SurfaceFleet: Exploring Distributed Interactions Unbounded from Device, Application, User, and Time

Microsoft Research

Haptic PIVOT: On-Demand Handhelds in VR

Haptic PIVOT: On-Demand Handhelds in VR

Microsoft Research

SurfaceFleet Supplemental Video Demonstration (UIST 2020)

SurfaceFleet Supplemental Video Demonstration (UIST 2020)

Microsoft Research

This video teaches the application of statistical preconditioning to distributed optimization problems, covering key concepts such as mirror descent, Bregman divergence, and relative condition number. The speaker discusses the benefits of statistical preconditioning, including improved convergence rates and quadratic performance. By watching this video, viewers can gain a deeper understanding of distributed optimization and statistical preconditioning, and learn how to apply these concepts to re

Key Takeaways

Use mirror descent and Bregman divergence to optimize distributed optimization problems
Minimize the dot product between new parameters and gradients
Penalize the problem by Euclidean distance
Use concentration inequalities for sharp bounds on kappa f
Evaluate the effectiveness of statistical preconditioning in various scenarios

💡 Statistical preconditioning can achieve fast convergence to the optimum in distributed optimization even with small data sets, and can improve the relative condition number.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Maths Basics

View skill →

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Data Science Fundamentals: Data Cleaning in Python

Data Science Fundamentals: Data Cleaning in Python

Related Reads

Is the AI bubble about to burst? A data scientist’s honest take

A data scientist shares their honest take on whether the AI bubble is about to burst, providing an informed perspective on the technology's potential and limitations

Is the AI bubble about to burst? A data scientist’s honest take

A data scientist shares their honest take on whether the AI bubble is about to burst, providing a grounded perspective on the technology

Medium · Machine Learning

Is the AI bubble about to burst? A data scientist’s honest take

A data scientist shares their honest take on whether the AI bubble is about to burst, providing a grounded perspective on the technology

Medium · Data Science

Is the AI bubble about to burst? A data scientist’s honest take

A data scientist shares their honest take on whether the AI bubble is about to burst, providing a grounded perspective on the technology

Medium · Startup

Tackling Malaria in Africa with Technology at the Huawei ICT Competition