PyTorch at Dolby Labs - Vivek Kumar, Dolby Labs
Key Takeaways
Dolby Labs utilizes PyTorch for deep learning in audio processing, addressing challenges such as high dimensionality and temporal dependency, and achieving breakthroughs in speech coding and voice conversion. The solutions involve spectrogram-based representation, audio-specific networks, and models like WAV RNN and Sample RNN.
Full Transcript
[Music] hi my name is Vivek and I lead the AIT in module B so in the last ten years we have had great success in applying deep learning to audio but I'm more excited about the fact that this is just a beginning of what's possible I plan to go over some of the challenges of using deep learning for audio recent breakthroughs and some applications we are working on I hope after the presentation you'll find yourself curious about audio AI before I go into rest of my presentation I wanted to acknowledge how helpful part watch has been in our journey my team loves it because it's easy to use dynamic graphs make it easy to iterate over architectures and the support is excellent in the real case we found a bug the patch we provided was merged in days more personally my framework of charge used to be torch but when fighters came along I ended up learning Python just so that I could use PI torch that's me really glad I did that also wanted to give a quick shout out the speech print project which Dolby sponsoring the team in miele is developing a toolkit which would simplify doing speech research on top of pi torch check out their website for details here's a brief history of Dolby's innovation in the audio space for over 50 years we have created solutions which enhance audio experience starting with noise reduction in the 60s to creating technologies like the alba digital plus and Dalby act mas which are now standard for high quality audio there are over 11 billion devices with Dolby audio anytime you're listening to high-quality audio you're likely using to all this technology and in the last few years as deep learning is fundamentally changing how audio processing is done we are combining our audio expertise to create new state-of-the-art technologies talking about challenges a significant strength of deep learning is to work with draw samples without any handcrafted features but this gets very challenging with audio the first difficulty is dimensions consider a 64 by 64 pixel image it contains a lot of information you can identify the celebrity guess their age and their ethnicity but equivalent bytes of uncompressed audio is just enough for one word secondly audio has a structure at multiple time scales ranging from the scales of milliseconds to minutes each sample of audio is dependent on the sample preceding it but on a larger time scale is also dependent on the node being played or the phoneme being spoken modeling all these temporal dependencies becomes challenging thirdly perception which of these sound different in audio perception matters a lot even though this waveform look very different they sound exactly the same in most deep learning application l1 or l2 losses are usually good enough but they're very brittle when it comes to audio things like phase shift alignment errors or clock drifts make this measure completely break down so to deal with these challenges there are two basic approaches one is to use spectrogram based representation so that audio is transformed into an image like representation and we can use image inspired networks the other option is to use networks designed specifically for audio which is what I'm gonna focus on in the next few slides three years ago there was a breakthrough in speed generation or audio generation to order regressive models were developed with generated audio on a sample by sample basis both these models use slightly different approaches wavenet used dilated convolution where as sample RNN from mila use a multi rate RNN but what's important here is both these architectures were designed specifically for audio and handled the high dimensionality and the multi-level temporal dependency of audio more recently we have had models like WAV RNN and WAV glow which also generate audio on a sample by sample basis all these models were able to achieve a naturalness which was significantly better than all the prior approaches in fact these approaches were so powerful they led to a breakthrough in speech coding and by speech coding I mean speech compression in the last two years both Google and Dalby have published works that drastically improve speech coding while Google's focus has been on low bitrate our focus has been on high quality audio describing what we do in audio coding is always challenging so I'm boring an analogy that our partners at Netflix used to describe video coding she is Mary Kondo the author of life-changing magic of tidying up she has added decluttering show on Netflix and the approach she uses for decluttering is to pick up each item and discard everything which does not give joy and after you have discarded most of your positions she has a great method of folding everything into squares so that they can be efficiently packed and we do something similar in speech coding we analyzed to identify what is essential discarding everything else then we pack this bits in a way which is the most efficient and on the decoder side we unpack the bits and reconstruct the speech this way of encoding decoding has been used for decades but at really low bitrate when we have discarded a lot of information it's hard to synthesize speech which is high quality but now we're deep learning we have powerful generative models which can generate high-quality speech which is natural sounding giving your joy back getting a bit deeper the first year of sample RNN is an MLP which is done which is then connected to a stack of GRU rnas running at different time resolutions the lowest layer is running on a sample resolution whereas the topmost layer is running on a 10 millisecond or 160 sample resolution the idea being that these are an ends focus on a different level of abstraction phoneme identity on the top fine details on the bottom and this is the way it is able to manage the multi-level temporal dependency of audio without conditioning sample Aaron and babbles which is producing sounds which vaguely sound like speech but does not make any sense the control sample or an end the condition at using quantize recorded parameters from the bit stream the bit stream is generated using an internal vocoder which is able to capture the essence of speech at really low bitrate if you're interested in learning more we have a poster please check it out here are the listening test results mr white band is the current state of the art codec which sorry mr white band is the current codec which is being used in our cell phones silk is the current state of the art codec which at lower bitrate is able to generate a quality better than mr white band our solution sample our own on even at 6.4 kilobits per second we were able to achieve a quality which was comparable or better than silk at 16 kilobits per second just to give you an idea how significant this is the last breakthrough which happened in speech coding was over 30 years ago when kelp came out kelp reduce the bitrate by approximately 20 to 30 percent this is 2.5 times improvement this function this work and similar work done by Google is the biggest step function speech coding has ever seen now talking to now let's talk about a completely different application voice conversion so voice conversion is a technique where we can make somebody speech sound like that of a target speech oh the way we achieve it was by using an architecture similar to audio coding but instead of conditioning it on codec parameters we conditioned it on content and target speaker embeddings these target speaker embeddings end up learning the style of the target speaker like how they pronounce their phone names their fundamental frequency their accents our quality was much better than conventional voice conversion techniques and the results were published in interspace 2018 let me show you a quick demo the first audio is a source speaker which we would derive the content the next is a target whose style we are trying to emulate and finally is a synthesized speech which should sound like the target speech so the input speech those who hold the property think so too and so far it is fortunate dock target his flatteries delude and his professions of affection gratify you the synthesized speech those who hold the property think so too and so far it is fortunate amazing isn't it we are very excited about the potentials here so hopefully this has provided some connection using deep learning for audio some challenges and some recent developments and thank you PI thoughts for being awesome partners along the way hopefully have inspired some of you to be more curious and excited about the work happening in this area personally I'm really excited by the progress community has made but I'm more amazed by the fact that this is just the beginning and it's up to us to define where this technology takes us thank you if you are interested in learning more I will be hanging out next to our poster most of my team would be there as well also feel free to contact me on Twitter [Music] [Applause]
Original Description
Hear how Dolby Labs is using PyTorch to develop deep learning for audio, and learn about the challenges that audio AI presents and the breakthroughs and applications they’ve built at Dolby to push the field forward.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from PyTorch · PyTorch · 44 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
▶
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
What is PyTorch?
PyTorch
PyTorch Tutorial: A Quick Preview
PyTorch
PyTorch Summer Hackathon 2019
PyTorch
Tips and Tricks on Hacking with PyTorch: A Quick Tutorial by Brad Heintz
PyTorch
PyTorch 1.2 and PyTorch Hub: A Quick Introduction by Soumith Chintala and Ailing Zhang
PyTorch
Torchtext 0.4 with Supervised Learning Datasets: A Quick Introduction by George Zhang
PyTorch
Torchaudio 0.3 with Kaldi Compatibility, New Transforms: A Quick Introduction by Jason Lian
PyTorch
Torchvision 0.4 with Support for Video: A Quick Introduction by Francisco Massa
PyTorch
Introduction to Machine Learning for Developers at F8 2019
PyTorch
Powered by PyTorch at F8 2019
PyTorch
Developing and Scaling AI Experiences at Facebook with PyTorch at F8 2019
PyTorch
New Approaches to Image and Video Reconstruction Using Deep Learning at Facebook at F8 2019
PyTorch
PyTorch Developer Conference 2018: Recap
PyTorch
PyTorch Developer Conference 2018: Keynote & Deep Dive
PyTorch
PyTorch Developer Conference 2018: Production & Research Sessions
PyTorch
PyTorch Developer Conference 2018: Cloud & Academia Sessions
PyTorch
PyTorch Developer Conference 2018: Enterprise, Education, & Future of AI Panel
PyTorch
PyTorch Developer Conference 2019 | Full Livestream
PyTorch
PyTorch Developer Conference 2019: Recap
PyTorch
PyTorch Developer Conference Keynote - Mike Schroepfer
PyTorch
What’s new in PyTorch 1.3 - Lin Qiao
PyTorch
PyTorch Front-End Features: Named Tensors and Type Promotion - Gregory Chanan
PyTorch
Research to Production: PyTorch JIT/TorchScript Updates - Michael Suo
PyTorch
Quantization - Dmytro Dzhulgakov
PyTorch
PyTorch ONNX Export Support - Lara Haidar, Microsoft
PyTorch
Apex - Michael Carilli, NVIDIA
PyTorch
Dataloader Design for PyTorch - Tongzhou Wang, MIT
PyTorch
Linear Algebra in PyTorch - Vishwak Srinivasan, CMU
PyTorch
PyTorch Mobile - David Reiss
PyTorch
Model Interpretability with Captum - Narine Kokhilkyan
PyTorch
Detectron2 - Next Gen Object Detection Library - Yuxin Wu
PyTorch
Speech Extensions to Fairseq - Dmytro Okhonko
PyTorch
PyTorch on Google Cloud TPUs - Google, Salesforce, Facebook
PyTorch
PyTorch Summer Hackathon Winners - Joe Spisak, Sebastien Arnold, Tristan Deleu
PyTorch
PyTorch in Robotics - Yisong Yue, Caltech
PyTorch
StanfordNLP - Yuhao Zhang, Stanford
PyTorch
Sotabench for Reproducible Research - Robert Stojnic, Papers with Code
PyTorch
Collaborative Natural Language Inference - Sasha Rush, Cornell
PyTorch
Privacy Preserving AI - Andrew Trask, OpenMined
PyTorch
CrypTen - Laurens van der Maaten
PyTorch
PyTorch at Uber - Sidney Zhang, Uber
PyTorch
PyTorch at Tesla - Andrej Karpathy, Tesla
PyTorch
PyTorch at Microsoft - Saurabh Tiwary, Microsoft
PyTorch
PyTorch at Dolby Labs - Vivek Kumar, Dolby Labs
PyTorch
PyTorch Developer Conference 2019 - Panel Discussion
PyTorch
Using deep learning and PyTorch to power next gen aircraft at Caltech
PyTorch
Named Tensors, Model Quantization, and the Latest PyTorch Features - Part 1
PyTorch
TorchScript and PyTorch JIT | Deep Dive
PyTorch
Announcing the PyTorch Global Summer Hackathon 2020
PyTorch
Opening Up the Black Box: Model Understanding with Captum and PyTorch
PyTorch
PyTorch Mobile Runtime for Android
PyTorch
Torchvision in 5 minutes
PyTorch
3D Deep Learning with PyTorch3D
PyTorch
What is Torchtext?
PyTorch
TorchAudio: A Quick Intro
PyTorch
PyTorch Mobile Runtime for iOS
PyTorch
PySlowFast: Deep learning with Video
PyTorch
PyTorch Pruning | How it's Made by Michela Paganini
PyTorch
Measuring Fairness in Machine Learning Systems
PyTorch
PyTorch for Hackathons
PyTorch
More on: ML Maths Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Want to get started with deep learning
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Medium · Deep Learning
🎓
Tutor Explanation
DeepCamp AI