PyTorch at Dolby Labs - Vivek Kumar, Dolby Labs

PyTorch · Beginner ·🧬 Deep Learning ·6y ago

Skills: ML Maths Basics80%Supervised Learning70%Unsupervised Learning60%ML Pipelines50%

Key Takeaways

Dolby Labs utilizes PyTorch for deep learning in audio processing, addressing challenges such as high dimensionality and temporal dependency, and achieving breakthroughs in speech coding and voice conversion. The solutions involve spectrogram-based representation, audio-specific networks, and models like WAV RNN and Sample RNN.

Full Transcript

[Music] hi my name is Vivek and I lead the AIT in module B so in the last ten years we have had great success in applying deep learning to audio but I'm more excited about the fact that this is just a beginning of what's possible I plan to go over some of the challenges of using deep learning for audio recent breakthroughs and some applications we are working on I hope after the presentation you'll find yourself curious about audio AI before I go into rest of my presentation I wanted to acknowledge how helpful part watch has been in our journey my team loves it because it's easy to use dynamic graphs make it easy to iterate over architectures and the support is excellent in the real case we found a bug the patch we provided was merged in days more personally my framework of charge used to be torch but when fighters came along I ended up learning Python just so that I could use PI torch that's me really glad I did that also wanted to give a quick shout out the speech print project which Dolby sponsoring the team in miele is developing a toolkit which would simplify doing speech research on top of pi torch check out their website for details here's a brief history of Dolby's innovation in the audio space for over 50 years we have created solutions which enhance audio experience starting with noise reduction in the 60s to creating technologies like the alba digital plus and Dalby act mas which are now standard for high quality audio there are over 11 billion devices with Dolby audio anytime you're listening to high-quality audio you're likely using to all this technology and in the last few years as deep learning is fundamentally changing how audio processing is done we are combining our audio expertise to create new state-of-the-art technologies talking about challenges a significant strength of deep learning is to work with draw samples without any handcrafted features but this gets very challenging with audio the first difficulty is dimensions consider a 64 by 64 pixel image it contains a lot of information you can identify the celebrity guess their age and their ethnicity but equivalent bytes of uncompressed audio is just enough for one word secondly audio has a structure at multiple time scales ranging from the scales of milliseconds to minutes each sample of audio is dependent on the sample preceding it but on a larger time scale is also dependent on the node being played or the phoneme being spoken modeling all these temporal dependencies becomes challenging thirdly perception which of these sound different in audio perception matters a lot even though this waveform look very different they sound exactly the same in most deep learning application l1 or l2 losses are usually good enough but they're very brittle when it comes to audio things like phase shift alignment errors or clock drifts make this measure completely break down so to deal with these challenges there are two basic approaches one is to use spectrogram based representation so that audio is transformed into an image like representation and we can use image inspired networks the other option is to use networks designed specifically for audio which is what I'm gonna focus on in the next few slides three years ago there was a breakthrough in speed generation or audio generation to order regressive models were developed with generated audio on a sample by sample basis both these models use slightly different approaches wavenet used dilated convolution where as sample RNN from mila use a multi rate RNN but what's important here is both these architectures were designed specifically for audio and handled the high dimensionality and the multi-level temporal dependency of audio more recently we have had models like WAV RNN and WAV glow which also generate audio on a sample by sample basis all these models were able to achieve a naturalness which was significantly better than all the prior approaches in fact these approaches were so powerful they led to a breakthrough in speech coding and by speech coding I mean speech compression in the last two years both Google and Dalby have published works that drastically improve speech coding while Google's focus has been on low bitrate our focus has been on high quality audio describing what we do in audio coding is always challenging so I'm boring an analogy that our partners at Netflix used to describe video coding she is Mary Kondo the author of life-changing magic of tidying up she has added decluttering show on Netflix and the approach she uses for decluttering is to pick up each item and discard everything which does not give joy and after you have discarded most of your positions she has a great method of folding everything into squares so that they can be efficiently packed and we do something similar in speech coding we analyzed to identify what is essential discarding everything else then we pack this bits in a way which is the most efficient and on the decoder side we unpack the bits and reconstruct the speech this way of encoding decoding has been used for decades but at really low bitrate when we have discarded a lot of information it's hard to synthesize speech which is high quality but now we're deep learning we have powerful generative models which can generate high-quality speech which is natural sounding giving your joy back getting a bit deeper the first year of sample RNN is an MLP which is done which is then connected to a stack of GRU rnas running at different time resolutions the lowest layer is running on a sample resolution whereas the topmost layer is running on a 10 millisecond or 160 sample resolution the idea being that these are an ends focus on a different level of abstraction phoneme identity on the top fine details on the bottom and this is the way it is able to manage the multi-level temporal dependency of audio without conditioning sample Aaron and babbles which is producing sounds which vaguely sound like speech but does not make any sense the control sample or an end the condition at using quantize recorded parameters from the bit stream the bit stream is generated using an internal vocoder which is able to capture the essence of speech at really low bitrate if you're interested in learning more we have a poster please check it out here are the listening test results mr white band is the current state of the art codec which sorry mr white band is the current codec which is being used in our cell phones silk is the current state of the art codec which at lower bitrate is able to generate a quality better than mr white band our solution sample our own on even at 6.4 kilobits per second we were able to achieve a quality which was comparable or better than silk at 16 kilobits per second just to give you an idea how significant this is the last breakthrough which happened in speech coding was over 30 years ago when kelp came out kelp reduce the bitrate by approximately 20 to 30 percent this is 2.5 times improvement this function this work and similar work done by Google is the biggest step function speech coding has ever seen now talking to now let's talk about a completely different application voice conversion so voice conversion is a technique where we can make somebody speech sound like that of a target speech oh the way we achieve it was by using an architecture similar to audio coding but instead of conditioning it on codec parameters we conditioned it on content and target speaker embeddings these target speaker embeddings end up learning the style of the target speaker like how they pronounce their phone names their fundamental frequency their accents our quality was much better than conventional voice conversion techniques and the results were published in interspace 2018 let me show you a quick demo the first audio is a source speaker which we would derive the content the next is a target whose style we are trying to emulate and finally is a synthesized speech which should sound like the target speech so the input speech those who hold the property think so too and so far it is fortunate dock target his flatteries delude and his professions of affection gratify you the synthesized speech those who hold the property think so too and so far it is fortunate amazing isn't it we are very excited about the potentials here so hopefully this has provided some connection using deep learning for audio some challenges and some recent developments and thank you PI thoughts for being awesome partners along the way hopefully have inspired some of you to be more curious and excited about the work happening in this area personally I'm really excited by the progress community has made but I'm more amazed by the fact that this is just the beginning and it's up to us to define where this technology takes us thank you if you are interested in learning more I will be hanging out next to our poster most of my team would be there as well also feel free to contact me on Twitter [Music] [Applause]

Original Description

Hear how Dolby Labs is using PyTorch to develop deep learning for audio, and learn about the challenges that audio AI presents and the breakthroughs and applications they’ve built at Dolby to push the field forward.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from PyTorch · PyTorch · 44 of 60

← Previous Next →

What is PyTorch?

What is PyTorch?

PyTorch Tutorial: A Quick Preview

PyTorch Tutorial: A Quick Preview

PyTorch Summer Hackathon 2019

PyTorch Summer Hackathon 2019

Tips and Tricks on Hacking with PyTorch: A Quick Tutorial by Brad Heintz

Tips and Tricks on Hacking with PyTorch: A Quick Tutorial by Brad Heintz

PyTorch 1.2 and PyTorch Hub: A Quick Introduction by Soumith Chintala and Ailing Zhang

PyTorch 1.2 and PyTorch Hub: A Quick Introduction by Soumith Chintala and Ailing Zhang

Torchtext 0.4 with Supervised Learning Datasets: A Quick Introduction by George Zhang

Torchtext 0.4 with Supervised Learning Datasets: A Quick Introduction by George Zhang

Torchaudio 0.3 with Kaldi Compatibility, New Transforms: A Quick Introduction by Jason Lian

Torchaudio 0.3 with Kaldi Compatibility, New Transforms: A Quick Introduction by Jason Lian

Torchvision 0.4 with Support for Video: A Quick Introduction by Francisco Massa

Torchvision 0.4 with Support for Video: A Quick Introduction by Francisco Massa

Introduction to Machine Learning for Developers at F8 2019

Introduction to Machine Learning for Developers at F8 2019

Powered by PyTorch at F8 2019

Powered by PyTorch at F8 2019

Developing and Scaling AI Experiences at Facebook with PyTorch at F8 2019

Developing and Scaling AI Experiences at Facebook with PyTorch at F8 2019

New Approaches to Image and Video Reconstruction Using Deep Learning at Facebook at F8 2019

New Approaches to Image and Video Reconstruction Using Deep Learning at Facebook at F8 2019

PyTorch Developer Conference 2018: Recap

PyTorch Developer Conference 2018: Recap

PyTorch Developer Conference 2018: Keynote & Deep Dive

PyTorch Developer Conference 2018: Keynote & Deep Dive

PyTorch Developer Conference 2018: Production & Research Sessions

PyTorch Developer Conference 2018: Production & Research Sessions

PyTorch Developer Conference 2018: Cloud & Academia Sessions

PyTorch Developer Conference 2018: Cloud & Academia Sessions

PyTorch Developer Conference 2018: Enterprise, Education, & Future of AI Panel

PyTorch Developer Conference 2018: Enterprise, Education, & Future of AI Panel

PyTorch Developer Conference 2019 | Full Livestream

PyTorch Developer Conference 2019 | Full Livestream

PyTorch Developer Conference 2019: Recap

PyTorch Developer Conference 2019: Recap

PyTorch Developer Conference Keynote - Mike Schroepfer

PyTorch Developer Conference Keynote - Mike Schroepfer

What’s new in PyTorch 1.3 - Lin Qiao

What’s new in PyTorch 1.3 - Lin Qiao

PyTorch Front-End Features: Named Tensors and Type Promotion - Gregory Chanan

PyTorch Front-End Features: Named Tensors and Type Promotion - Gregory Chanan

Research to Production: PyTorch JIT/TorchScript Updates - Michael Suo

Research to Production: PyTorch JIT/TorchScript Updates - Michael Suo

Quantization - Dmytro Dzhulgakov

Quantization - Dmytro Dzhulgakov

PyTorch ONNX Export Support - Lara Haidar, Microsoft

PyTorch ONNX Export Support - Lara Haidar, Microsoft

Apex - Michael Carilli, NVIDIA

Apex - Michael Carilli, NVIDIA

Dataloader Design for PyTorch - Tongzhou Wang, MIT

Dataloader Design for PyTorch - Tongzhou Wang, MIT

Linear Algebra in PyTorch - Vishwak Srinivasan, CMU

Linear Algebra in PyTorch - Vishwak Srinivasan, CMU

PyTorch Mobile - David Reiss

PyTorch Mobile - David Reiss

Model Interpretability with Captum - Narine Kokhilkyan

Model Interpretability with Captum - Narine Kokhilkyan

Detectron2 - Next Gen Object Detection Library - Yuxin Wu

Detectron2 - Next Gen Object Detection Library - Yuxin Wu

Speech Extensions to Fairseq - Dmytro Okhonko

Speech Extensions to Fairseq - Dmytro Okhonko

PyTorch on Google Cloud TPUs - Google, Salesforce, Facebook

PyTorch on Google Cloud TPUs - Google, Salesforce, Facebook

PyTorch Summer Hackathon Winners - Joe Spisak, Sebastien Arnold, Tristan Deleu

PyTorch Summer Hackathon Winners - Joe Spisak, Sebastien Arnold, Tristan Deleu

PyTorch in Robotics - Yisong Yue, Caltech

PyTorch in Robotics - Yisong Yue, Caltech

StanfordNLP - Yuhao Zhang, Stanford

StanfordNLP - Yuhao Zhang, Stanford

Sotabench for Reproducible Research - Robert Stojnic, Papers with Code

Sotabench for Reproducible Research - Robert Stojnic, Papers with Code

Collaborative Natural Language Inference - Sasha Rush, Cornell

Collaborative Natural Language Inference - Sasha Rush, Cornell

Privacy Preserving AI - Andrew Trask, OpenMined

Privacy Preserving AI - Andrew Trask, OpenMined

CrypTen - Laurens van der Maaten

CrypTen - Laurens van der Maaten

PyTorch at Uber - Sidney Zhang, Uber

PyTorch at Uber - Sidney Zhang, Uber

PyTorch at Tesla - Andrej Karpathy, Tesla

PyTorch at Tesla - Andrej Karpathy, Tesla

PyTorch at Microsoft - Saurabh Tiwary, Microsoft

PyTorch at Microsoft - Saurabh Tiwary, Microsoft

PyTorch at Dolby Labs - Vivek Kumar, Dolby Labs

PyTorch at Dolby Labs - Vivek Kumar, Dolby Labs

PyTorch Developer Conference 2019 - Panel Discussion

PyTorch Developer Conference 2019 - Panel Discussion

Using deep learning and PyTorch to power next gen aircraft at Caltech

Using deep learning and PyTorch to power next gen aircraft at Caltech

Named Tensors, Model Quantization, and the Latest PyTorch Features - Part 1

Named Tensors, Model Quantization, and the Latest PyTorch Features - Part 1

TorchScript and PyTorch JIT | Deep Dive

TorchScript and PyTorch JIT | Deep Dive

Announcing the PyTorch Global Summer Hackathon 2020

Announcing the PyTorch Global Summer Hackathon 2020

Opening Up the Black Box: Model Understanding with Captum and PyTorch

Opening Up the Black Box: Model Understanding with Captum and PyTorch

PyTorch Mobile Runtime for Android

PyTorch Mobile Runtime for Android

Torchvision in 5 minutes

Torchvision in 5 minutes

3D Deep Learning with PyTorch3D

3D Deep Learning with PyTorch3D

What is Torchtext?

What is Torchtext?

TorchAudio: A Quick Intro

TorchAudio: A Quick Intro

PyTorch Mobile Runtime for iOS

PyTorch Mobile Runtime for iOS

PySlowFast: Deep learning with Video

PySlowFast: Deep learning with Video

PyTorch Pruning | How it's Made by Michela Paganini

PyTorch Pruning | How it's Made by Michela Paganini

Measuring Fairness in Machine Learning Systems

Measuring Fairness in Machine Learning Systems

PyTorch for Hackathons

PyTorch for Hackathons

This video discusses Dolby Labs' use of PyTorch for deep learning in audio processing, including challenges, solutions, and breakthroughs in speech coding and voice conversion. Viewers can learn about the applications of deep learning in audio and the techniques used to overcome challenges. The video also covers the use of spectrogram-based representation, audio-specific networks, and models like WAV RNN and Sample RNN.

Key Takeaways

Understand the challenges of audio processing
Learn about spectrogram-based representation and audio-specific networks
Study the WAV RNN and Sample RNN models
Explore applications of deep learning in audio
Implement deep learning pipelines for audio processing

💡 Deep learning can be used to improve audio processing by addressing challenges such as high dimensionality and temporal dependency, and achieving breakthroughs in speech coding and voice conversion.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Maths Basics

View skill →

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Data Science Fundamentals: Data Cleaning in Python

Data Science Fundamentals: Data Cleaning in Python

Related Reads

I Found the Neural Network I Built in Class 9 — Here’s What Happened When I Tried to Run It Again

Revisiting a 4-year-old neural network project for handwritten digit recognition using a convolutional neural network and analyzing its performance

Medium · Deep Learning

Introduction to Deep Learning and Neural Networks: From Human Brain to Artificial Intelligence

Learn how biological neurons inspired artificial neural networks and deep learning, transforming the AI landscape

Medium · Deep Learning

Want to get started with deep learning

Get started with deep learning by leveraging resources like Andrew Karpathy's playlist and frameworks such as TensorFlow or PyTorch

Reddit r/deeplearning

Building a Deepfake Detector From Scratch — What Nobody Tells You

Learn to build a deepfake detector from scratch and understand the challenges involved in detecting AI-generated fake media

Medium · Deep Learning

Image Classification with ml5.js

The Coding Train