Getting Started With Torchaudio | PyTorch Tutorial

AssemblyAI · Beginner ·🧠 Large Language Models ·4y ago

Key Takeaways

This PyTorch tutorial covers the basics of Torchaudio, including loading and saving audio data, applying transformations, and working with audio datasets. The tutorial demonstrates how to use Torchaudio to load and manipulate audio files, apply effects and filtering, and extract audio features.

Full Transcript

hi everyone in this video we learn how to work with torch audio so we learn how to load audio data do basic transformations with it and save the data again we also learn how to do resampling how to do data augmentation so we apply some audio effects and we learn how to do feature extraction so now let's get started all right so first let's have a look at how we install torch audio and we can do it the same way as we are installing torch and torch vision so we can either say pip install torch audio or we can say conda install torch audio and one thing i want to mention is that right now torch audio is not supported on m1 max but for this we can just use a google colab like i'm using today so in there it works just fine and then i also want to mention that they have pretty good official documentation so the code that i'm showing today is inspired by the official docs and i will put the link in the description if you want to check this out for even more information and then i also want to mention that we have a github repository where we upload most of the code to our youtube videos so i will upload the um cola from today in there as well and yeah so now let's have a look at how we can work with torch audio so first we import torch and also torch audio and then we print the versions to see if everything is working fine so yeah in our case everything is installed correctly then we also import a few other helper modules so we're also going to work with the requests module to download some audio files and here i have different urls to audio files that are uploaded to s3 buckets so what i'm doing here is i create this folder the assets folder and then i have a helper function that will call this url so it will say requests.geturl and then it will write the content of this url into a file so this will basically download the files and store them locally so if we run this and then have a look at our folder structure in our cola then here we find this assets folder and in there you should find four different files so for example we have an mp3 file and then we have three wav files so this is working and now first let's have a look at how we can query audio metadata so we can do this on either downloaded files or also on file like objects for example if we use the url and use the requests module to get this url and then we can get the file data by saying response dot raw so this is not yet downloaded but we can still get the metadata so we can say torch audio dot info and then we pass in this file like object so this is the raw data and then we might have to specify the format so in this case we have to say this is in the wav format and now let's run this and have a look at how this looks so we can see it fetched this many bytes and here we see the metadata so we have a audio metadata object and then we can see the sample rate the number of frames the number of channels the bits per sample and also the encoding method so yeah we can do this on file like objects and we can do the same on downloaded files so this is the path to a downloaded file in here and for this we can simply say torch audio dot info and we don't need to specify the format so now if we run this then we should see these same information as here so yeah this is working fine then let's have a look at how we load audio files so for this we can say torch audio dot load and then we pass in the file name and this will return two things that we can unpack here so this will return a waveform object and a sample rate so the sample rate is just an integer and the waveform object is a pi torch tensor and the values are normalized in the range minus one to plus one and it has this size and the first number is the number of channels and then we have the number of frames so a audio file can also have multiple channels so you will see this later so yeah now we have loaded this so we get the waveform and the sample rate and then here i have a little helper function to play the audio file so for this we use iron python display and then we can import audio and display and we can convert our tensor to a numpy array and then here like i said the waveform.shape is in the form number of channels and number of frames so we can unpack this and then we simply say we display the audio and now if we run this then we see this little button here to play the audio so now i can hit play i had that curiosity beside me at this moment so yeah this works then let's also plot our data so here i define two helper functions plot waveform and plot spectrum so here we plot the spectrogram and for this we use mud plot lips so i'm not going to go into detail here but you find the code on github so we need to extract the number of channels and the number of frames then we need to calculate our time axis so here we divide by our sample rate and then for the waveform we can simply use axis.plot and for the spectrogram we can use axis.spec so this makes it pretty easy in matplotlib to plot this so now if we run this we define these two function and functions and now let's um call the plot waveform function and this is how the waveform looks like so here you can see our audio signal and now let's also run the plots back gram function so here we see the spectrogram now let's have a look at how we save data again so for this we can say torch audio dot save then we need to specify a path or the file name then we pass in the tensor object so our waveform then we also pass in the sample rate and then we could use optional parameters for example we could specify the encoding then the bits per sample so by default this is 32 and we could also change the format so for example we could use um mp3 here so we could put in mp3 and then also change the file name to mp3 and then it will convert it to another file format so now if we run this and then have a look at our folder so yeah in here now we have our new saved file and then let's quickly test this again by loading our saved file again so here we load the metadata by saying torch audio dot info then we say torch audio to dot load and then we can play the loaded file again so yeah here we see the metadata here we now see we have bits per sample is only 16 and the encoding is also what we specified here and if we run this then it should still sound similar i had that curiosity beside me at this moment yeah so this works now let's have a look at how we resample audio data so we want to resample an audio waveform from one frequency to another and for this we could use two different approaches so we can use torch audio dot transforms or touch audio dot functional so here we import both of them then we load our example file again and here we have the original waveform and the original sample rate so i think by default this is forty four thousand one hundred so then let's define our new resample rate so in this case we want to have 32 000 and then the first approach uses the transform so here we create a resampler object so we say t dot resample then we need to pass in the original sample rate and the new one and we also give it the data type and then after having created this resampler we need to call this so we can then call resampler with the waveform and this gives us the resampled waveform or the second approach is to use directly f dot resample and then again the original waveform and sample rate and the new sample rate so functional.resample will calculate it on the fly and with this transforms.resample we pre-calculate a few things so this could be faster if you want to resample multiple waveforms with the same configuration but otherwise both are just fine and then let's play all the resampled files as well and let's run this and have a look at how this sounds so this is the original one then here we have the first resampled one [Music] and the second one and yeah i think they all sound pretty similar so in order to find the differences we have to actually have a look at the data but yeah basically this works and now let's also have a look at some different parameters that we could use for the resampling so one thing we could pass in is the low pass filter width so here the larger the value the sharper and the more precise is our filter but then it's also more computational expensive so the default filter width is six and yeah like i said we could increase this now for better results then we also could use the roll of value and this means a lower roll-off reduces the amount of aliasing but it will also reduce some of the higher frequencies so here the default is 0.99 so for example here we could use a lower value and say 0.80 and then we also could use a different window function so here this is the default value but yeah here you can play around with different ones and yeah for the exact details i can point you to the official documentation so there you can find more information about these parameters then let's have a look at different audio data augmentation methods so we have a look at how we apply effects and filtering how we can add background noise and how we can apply a different codec so for applying effects we use the torch audio dot socks effects module and here we have two different options so we can apply the effects to a file or we can apply the effects to a tensor and the way it works is that we define different effects as a list and then we call torch audio dot sox effects dot apply effects file and here we give it a path so the file name and then we pass in the effects and um yeah together list of all the different possible effects again you can have a look at the documentation or you could also use torch audio sucks effect dot effect names and if you run this then you see all the different effects that you can apply and yeah so in our case we want to apply a remix then a low pass and we also need to give it the rate and then here yeah we define this helper function get sample that simply applies these effects to a file and then loads the file and returns the waveform and the sample rate so here we get the sample from our example wav file and then we also want to plot this and then a second option here we apply effects to a tensor so again here we apply different effects for example we reduce the speed and we also apply a reverberation so this gives a dramatic feeling so now let's run this and then have a look at the different waveforms so this is the original one and this is the second one so here since we applied the reverberation this also means that now we have two channels so like i said earlier so if a waveform object can have more than one channel so in our case now this has two channels and it's also taking a little bit more time so here we have a little bit more than three seconds because we reduced the speed so now let's play the audio and have a look at the difference so this is the original one and this is the re-sampled one [Music] yeah so i think now we should clearly hear a difference then let's have a look at how we can add background noise so we can do this by manually doing some scaling and addition operations with our tensor so first we load a speech sample and then we also load a sample with some background noise so both of them are already downloaded in our assets folder so yeah we load these two samples and now here we um basically want to have this in the same shape so here we take the same number of frames from our speech sample and now we have the same number of frames in our noise sample and then we plot the noise data and have a look at them and then we calculate the speech power and the noise power by calculating the norm so with the order two and yeah this is basically the power of our signal and then here we test different signal to noise ratios here you can define different values so the higher the number the more clear you should understand the signal and then what we do here is we calculate a scale factor so we do this by calculating the signal to noise ratio times the noise power divided by the speech power and then we can create our noisy speech data by applying this operation to our original tensor so we can say scale times the speech tensor plus the noise so we add noise to the tensor and then we divide it by two and then here again we want to plot the signal and also play this so yeah this is how we add the background noise by applying scaling and then adding the noise so now let's run this and then here yeah like i said we also plot the different signals and then here we have the different signal to noise ratios and the different plots and yeah then we can for example let's listen to this one so this has the signal to noise ratio 20 and this is the audio file so let's play this i had that curiosity beside me at this moment yeah so now we should clearly hear the background noise but still should be able to hear the original speech and yeah you can play around here with different values for the signal to noise ratio and have a look at how this would sound like and now let's have a look at how we apply different codecs so if we scroll down and here we can apply different codecs and this is also pretty simple so for this we use functional dot apply codec and then we specify the original waveform and the sample rate and here we can define different codecs so we define this as a dictionary and yeah so here for example you could use different formats then for some formats you could also use the encoding and the bits per sample and then you simply apply f dot apply codec and then we can run this and then here again we could um listen to the different um applied codecs so now let's have a look i have that curiosity beside me at this moment yeah i think that's also pretty straightforward and yeah now let's have a look at how we can extract audio features so in this case i want to show you how we can extract the spec to gram so to get the frequency make up of an audio signal as it varies with time you can use spectrogram and for this we can simply use transforms so t dots back to gram and here again we can use different parameters and for this again i recommend that you check out the official documentation so we create this object and then we also have to apply this so we apply the spectrogram to our waveform and then we have it and then for example here i also defined another helper function to plot this so now let's run this and see how this looks and yeah so this is how the spectrogram looks like and yeah these are all the features i want to show you for now and then one more thing i want to mention is then that um torch audio also has a data sets module with this you can very easily download some popular audio data sets for example the yes no data sets and for this you simply call torch audio dot data sets and then the name then you can specify a file name and specify download equals true so this will download this data set and then also load it in memory and then we can for example access the first sample so this again will give us a waveform then the sample rate and also the label for example then we can play this and yeah then we can work with this so yeah now it's downloaded so now let's play the first example and yeah now we can work with this data set pretty cool so yeah that's all i wanted to show you for today i hope you enjoyed this tutorial and if so then please hit the like button and consider subscribing to our channel also if you want to test the assembly ai api then don't forget to grab your free api token using the link in the description below and then i hope to see you next time bye

Original Description

In this PyTorch tutorial we learn how to get started with Torchaudio and work with audio data. Get your Free Token for AssemblyAI Speech-To-Text API 👇 https://www.assemblyai.com/?utm_source=youtube&utm_medium=referral&utm_campaign=yt_pat_4 We learn about: - Load/Save audio data - Transformations - Resampling - Data Augmentation - Feature Extraction - Torchaudio Datasets Docs: https://pytorch.org/audio/stable/index.html Code: https://github.com/AssemblyAI/youtube-tutorials Timestamps: 00:00 Introduction 00:19 Load/Save/Transform Audio 08:16 Resampling 11:43 Data Augmentation 18:16 Feature Extraction 19:14 Torchaudio datasets
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from AssemblyAI · AssemblyAI · 29 of 60

1 Python Speech Recognition in 5 Minutes
Python Speech Recognition in 5 Minutes
AssemblyAI
2 Python Click Part 1 of 4
Python Click Part 1 of 4
AssemblyAI
3 Python Click Part 2 of 4
Python Click Part 2 of 4
AssemblyAI
4 Python Click Part 3 of 4
Python Click Part 3 of 4
AssemblyAI
5 Python Click Part 4 of 4
Python Click Part 4 of 4
AssemblyAI
6 Deep learning in 5 minutes | What is deep learning?
Deep learning in 5 minutes | What is deep learning?
AssemblyAI
7 How to make a web app that transcribes YouTube videos with Streamlit | Part 1
How to make a web app that transcribes YouTube videos with Streamlit | Part 1
AssemblyAI
8 How to make a web app that transcribes YouTube videos with Streamlit | Part 2
How to make a web app that transcribes YouTube videos with Streamlit | Part 2
AssemblyAI
9 Batch normalization | What it is and how to implement it
Batch normalization | What it is and how to implement it
AssemblyAI
10 Real-time Speech Recognition in 15 minutes with AssemblyAI
Real-time Speech Recognition in 15 minutes with AssemblyAI
AssemblyAI
11 Regularization in a Neural Network | Dealing with overfitting
Regularization in a Neural Network | Dealing with overfitting
AssemblyAI
12 Add speech recognition to your Streamlit apps in 5 minutes
Add speech recognition to your Streamlit apps in 5 minutes
AssemblyAI
13 Transformers for beginners | What are they and how do they work
Transformers for beginners | What are they and how do they work
AssemblyAI
14 Automatic Chapter Detection With AssemblyAI | Python Tutorial
Automatic Chapter Detection With AssemblyAI | Python Tutorial
AssemblyAI
15 Deep Learning Series Part 1 - What is Deep Learning?
Deep Learning Series Part 1 - What is Deep Learning?
AssemblyAI
16 Deep Learning Series part 2 - Why is it called “Deep Learning”?
Deep Learning Series part 2 - Why is it called “Deep Learning”?
AssemblyAI
17 Activation Functions In Neural Networks Explained | Deep Learning Tutorial
Activation Functions In Neural Networks Explained | Deep Learning Tutorial
AssemblyAI
18 Deep Learning Series part 3 - Deep Learning vs. Machine Learning
Deep Learning Series part 3 - Deep Learning vs. Machine Learning
AssemblyAI
19 Deep Learning Series part 4 - Why is Deep Learning better for NLP?
Deep Learning Series part 4 - Why is Deep Learning better for NLP?
AssemblyAI
20 Intro to Batch Normalization Part 1
Intro to Batch Normalization Part 1
AssemblyAI
21 Intro to Batch Normalization Part 2
Intro to Batch Normalization Part 2
AssemblyAI
22 Intro to Batch Normalization Part 3 - What is Normalization?
Intro to Batch Normalization Part 3 - What is Normalization?
AssemblyAI
23 Intro to Batch Normalization Part 4
Intro to Batch Normalization Part 4
AssemblyAI
24 Intro to Batch Normalization Part 5
Intro to Batch Normalization Part 5
AssemblyAI
25 Sentiment Analysis for Earnings Calls with AssemblyAI
Sentiment Analysis for Earnings Calls with AssemblyAI
AssemblyAI
26 Summarizing my favorite podcasts with Python
Summarizing my favorite podcasts with Python
AssemblyAI
27 Introduction to Regularization
Introduction to Regularization
AssemblyAI
28 How/Why Regularization in Neural Networks?
How/Why Regularization in Neural Networks?
AssemblyAI
Getting Started With Torchaudio | PyTorch Tutorial
Getting Started With Torchaudio | PyTorch Tutorial
AssemblyAI
30 Types of Regularization
Types of Regularization
AssemblyAI
31 Tuning Alpha in L1 and L2 Regularization
Tuning Alpha in L1 and L2 Regularization
AssemblyAI
32 Dropout Regularization
Dropout Regularization
AssemblyAI
33 What is GPT-3 and how does it work? | A Quick Review
What is GPT-3 and how does it work? | A Quick Review
AssemblyAI
34 Backpropagation For Neural Networks Explained | Deep Learning Tutorial
Backpropagation For Neural Networks Explained | Deep Learning Tutorial
AssemblyAI
35 Jupyter Notebooks Tutorial | How to use them & tips and tricks!
Jupyter Notebooks Tutorial | How to use them & tips and tricks!
AssemblyAI
36 Best Free Speech-To-Text APIs and Open Source Libraries
Best Free Speech-To-Text APIs and Open Source Libraries
AssemblyAI
37 Regularization - Early stopping
Regularization - Early stopping
AssemblyAI
38 Regularization - Data Augmentation
Regularization - Data Augmentation
AssemblyAI
39 Bias and Variance for Machine Learning | Deep Learning
Bias and Variance for Machine Learning | Deep Learning
AssemblyAI
40 Recurrent Neural Networks (RNNs) Explained - Deep Learning
Recurrent Neural Networks (RNNs) Explained - Deep Learning
AssemblyAI
41 What is BERT and how does it work? | A Quick Review
What is BERT and how does it work? | A Quick Review
AssemblyAI
42 Introduction to Transformers
Introduction to Transformers
AssemblyAI
43 Transformers | What is attention?
Transformers | What is attention?
AssemblyAI
44 Transformers | how attention relates to Transformers
Transformers | how attention relates to Transformers
AssemblyAI
45 Transformers | Basics of Transformers
Transformers | Basics of Transformers
AssemblyAI
46 Supervised Machine Learning Explained For Beginners
Supervised Machine Learning Explained For Beginners
AssemblyAI
47 Transformers | Basics of Transformers Encoders
Transformers | Basics of Transformers Encoders
AssemblyAI
48 Transformers | Basics of Transformers I/O
Transformers | Basics of Transformers I/O
AssemblyAI
49 How to evaluate ML models | Evaluation metrics for machine learning
How to evaluate ML models | Evaluation metrics for machine learning
AssemblyAI
50 Unsupervised Machine Learning Explained For Beginners
Unsupervised Machine Learning Explained For Beginners
AssemblyAI
51 Weight Initialization for Deep Feedforward Neural Networks
Weight Initialization for Deep Feedforward Neural Networks
AssemblyAI
52 Q-Learning Explained - Reinforcement Learning Tutorial
Q-Learning Explained - Reinforcement Learning Tutorial
AssemblyAI
53 Should You Use PyTorch or TensorFlow in 2022?
Should You Use PyTorch or TensorFlow in 2022?
AssemblyAI
54 What is Layer Normalization? | Deep Learning Fundamentals
What is Layer Normalization? | Deep Learning Fundamentals
AssemblyAI
55 I created a Python App to study FASTER
I created a Python App to study FASTER
AssemblyAI
56 How to create your FIRST NEURAL NETWORK with TensorFlow!
How to create your FIRST NEURAL NETWORK with TensorFlow!
AssemblyAI
57 Neural Networks Summary: All hyperparameters
Neural Networks Summary: All hyperparameters
AssemblyAI
58 Getting Started with OpenAI API and GPT-3 | Beginner Python Tutorial
Getting Started with OpenAI API and GPT-3 | Beginner Python Tutorial
AssemblyAI
59 Convert Speech-To-Text In Python in 60 seconds!
Convert Speech-To-Text In Python in 60 seconds!
AssemblyAI
60 Gradient Clipping for Neural Networks | Deep Learning Fundamentals
Gradient Clipping for Neural Networks | Deep Learning Fundamentals
AssemblyAI

This tutorial teaches the basics of Torchaudio, including loading and saving audio data, applying transformations, and working with audio datasets. It demonstrates how to use Torchaudio to load and manipulate audio files, apply effects and filtering, and extract audio features. By the end of the tutorial, you will be able to load and save audio data, apply transformations, and extract audio features.

Key Takeaways
  1. Install Torchaudio using pip or conda
  2. Download audio files using requests and store them locally
  3. Load audio files using Torchaudio load and return a waveform object and a sample rate
  4. Query audio metadata using Torchaudio info
  5. Apply transformations to audio data using Torchaudio transforms or functional
  6. Resample audio data using Torchaudio resample function
  7. Apply effects and filtering using Torchaudio sox effects module
  8. Extract audio features using Torchaudio spectrogram function
💡 Torchaudio provides a simple and efficient way to load, manipulate, and transform audio data, making it a powerful tool for audio processing and feature extraction.

Related AI Lessons

Claude AI vs ChatGPT: Which One Is Actually Better in 2026?
Compare Claude AI and ChatGPT based on real-world usage and benchmarking to determine which one is better in 2026
Medium · AI
Claude AI vs ChatGPT: Which One Is Actually Better in 2026?
Compare Claude AI and ChatGPT to determine which AI model is better for your needs in 2026
Medium · Programming
IntelliBooks: Classic RAG vs Graph RAG vs Agentic RAG – Choosing the Right AI Retrieval Architecture for Enterprise AI
Learn to choose the right AI retrieval architecture for enterprise AI between Classic RAG, Graph RAG, and Agentic RAG
Dev.to AI
Fluid, natural voice translation with Gemini 3.5 Live Translate
Learn about Gemini 3.5 Live Translate, a new voice translation technology that enables fluid and natural conversations across languages
Dev.to AI

Chapters (6)

Introduction
0:19 Load/Save/Transform Audio
8:16 Resampling
11:43 Data Augmentation
18:16 Feature Extraction
19:14 Torchaudio datasets
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →