Pytorch Torchtext Tutorial 2: Built in Datasets with Example

Aladdin Persson · Beginner ·🧬 Deep Learning ·6y ago

Key Takeaways

This video demonstrates how to use PyTorch's Torchtext library to load and utilize built-in datasets, specifically the Multi30k dataset for machine translation tasks. The tutorial covers importing necessary libraries, loading the dataset, defining tokenizers, building vocabularies, and creating iterators for training, validation, and testing.

Full Transcript

[Music] what's going on guys hope you're doing awesome and welcome back for another Pike torch video so in this video I want to show you an example of how to use the inbuilt data sets from torch text so as we can see in the screen we have data sets in a few different categories from you know sentiment analysis to question classification and yeah machine translation etc so it's really a great tool to get started in these different areas of NLP so I'm not gonna go through how to load all of these different data sets it's quite similar from just seeing one example so the one we're gonna go through is the Multi multi 30k data set which is under machine translation where we want to translate from English to German so as usual we have to do our import and so we're gonna import spacy we're gonna from torch text data sets import multi 30k which is the data so we're gonna use and then from torch text the data import field and Buckett iterator okay so what we want to do next is I'm gonna copy this in real quick so what we're gonna do is we're gonna do spacy underscore English spacey dot load TN and as I copied in here to download it you would do this Python M Spacey download and and you would have it of course you would have to have spacey first so people install Spacey and then we want similarly for for the German one so and it's under de and then we want to have our token so we're gonna do define tokenize and this is gonna be I guess some repetition from the last video so we're just gonna do return so text for toke in spacey English tokenizer of text and we're gonna copy this cuz we're gonna use sort of the same but for the German one and then we're gonna use Spacey underscore German GE Earth instead okay so that's the tokenizer then we're gonna do English is equal to filled sequential equals true use vocabulary equals true tokenize equals tokenize English and then lower equals true and then we're gonna do pretty much the same thing but we're gonna copy this and we're gonna do it for German instead and we're gonna use the same pre-processing but we're gonna use the other token so let's see what we want to do now is we want to use the multi 30k dataset so we're gonna do train data validation data and test data and we're gonna get all of those from the multi 30k dot split then we're gonna do eet XTS and then we're gonna do a tuple and then dot d e and then dot B n so extensions here tell us the the source language and the target language so the first one will be German and then so that would be the source language and we want to translate that into English and so when we write the fields we need to match that so this would be German comma English then what we want to do is we want to build our vocabulary so we're gonna do English that build vocabulary train theta and then let's do I don't know max size equals 10,000 and minimum frequency I don't know let's - and then let's copy this and we're just gonna do German build vocabulary alright last thing to do is get our iterators we're gonna do train iterator comma validation iterator and then test iterator bucket iterator dot splits and then we're gonna do train data comma validation data comma test data and I believe I didn't mention this in the last video and we did we use bucket iterator but it's very important that you match so the first one in this tuple is going to Train iterator the validation data which is the second is going to validate an iterator etc so that you you match those and then we're going to define a batch size to be mono 64 and then device is equal to CUDA so then we're gonna do for batch in train iterator we're gonna do print batch and and yeah we can just do that and let's see how it looks like so here we see let's see let's make this a little bit larger so here we see the the different batches all of them have yeah so we have dot source if you want to have this specific the German numerical eyes sentences I guess and then same for the English ones we would you just do batch that source or batch dot target but here we get the overview of the batch so we have 24 times 64 which is the number of examples and we see here that it's 23 so that the translation is sort of of equal length which makes sense right but as we see here it's not always of the same length sometimes the translation can be longer and of course all of them are padded as well maybe one of these translations is only ten in length but the longest one was 28 so everything else has to be pad padded to be exactly 20 native length so then what we what I want would like to show is sort of a a let's see we can do that in the in this one we can do sort of English dot vocabulary dot string to index so string two to index so we can do we can send in a word and we can get the index from the vocabulary so let's say we do that that would be the fifth index if we do something like I then we get yeah 954 and we can also Mac map back and we can do English vocabulary and we can do index to string and that would be let's say we do five we get back the so you can sort of use this if you would like to map Mac map back and you could sort of you could send it into a sequence of sequence Network and you would get some output and you would like to know well what is this actually saying you can go through each of them map them back to the word and then you would have the translated sentence so just a a minor detail that might be interesting to know so in the next video I'm gonna show you sort of a true example of how you would do this in a more real project so so more specifically let's say you have two large text files how would you actually go from just having those text files to actually having a a training and validation and test set and in a format where you can send it to torch text so check that out if you're interested in seeing any more I guess a real example of this and with that said thank you so much for watching this video if you have any questions leave it in the comment section and I hope to see you in the next one [Music]

Original Description

In this video I show you how to use and load the inbuilt datasets that are available for us through torchtext. In the example I show an example of machine translation using Multi30k dataset. Resources I used to learn about torchtext: https://torchtext.readthedocs.io/en/latest/ https://anie.me/On-Torchtext/ https://github.com/bentrevett https://towardsdatascience.com/how-to-use-torchtext-for-neural-machine-translation-plus-hack-to-make-it-5x-faster-77f3884d95 https://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/ ❤️ Support the channel ❤️ https://www.youtube.com/channel/UCkzW5JSFwvKRjXABI-UTAkQ/join Paid Courses I recommend for learning (affiliate links, no extra cost for you): ⭐ Machine Learning Specialization https://bit.ly/3hjTBBt ⭐ Deep Learning Specialization https://bit.ly/3YcUkoI 📘 MLOps Specialization http://bit.ly/3wibaWy 📘 GAN Specialization https://bit.ly/3FmnZDl 📘 NLP Specialization http://bit.ly/3GXoQuP ✨ Free Resources that are great: NLP: https://web.stanford.edu/class/cs224n/ CV: http://cs231n.stanford.edu/ Deployment: https://fullstackdeeplearning.com/ FastAI: https://www.fast.ai/ 💻 My Deep Learning Setup and Recording Setup: https://www.amazon.com/shop/aladdinpersson GitHub Repository: https://github.com/aladdinpersson/Machine-Learning-Collection ✅ One-Time Donations: Paypal: https://bit.ly/3buoRYH ▶️ You Can Connect with me on: Twitter - https://twitter.com/aladdinpersson LinkedIn - https://www.linkedin.com/in/aladdin-persson-a95384153/ Github - https://github.com/aladdinpersson
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Aladdin Persson · Aladdin Persson · 56 of 60

1 computeCost.m Linear Regression Cost Function - Machine Learning
computeCost.m Linear Regression Cost Function - Machine Learning
Aladdin Persson
2 gradientDescent.m Gradient Descent Implementation -  Machine Learning
gradientDescent.m Gradient Descent Implementation - Machine Learning
Aladdin Persson
3 Neural Network from scratch - Part 1 (Standard Notation)
Neural Network from scratch - Part 1 (Standard Notation)
Aladdin Persson
4 Neural Network from scratch - Part 2 (Forward Propagation)
Neural Network from scratch - Part 2 (Forward Propagation)
Aladdin Persson
5 Neural Network from scratch - Part 3 (Backward Propagation)
Neural Network from scratch - Part 3 (Backward Propagation)
Aladdin Persson
6 Neural Network from scratch - Part 4 (With Python)
Neural Network from scratch - Part 4 (With Python)
Aladdin Persson
7 sigmoid.m - Programming Assignment 2 Machine Learning
sigmoid.m - Programming Assignment 2 Machine Learning
Aladdin Persson
8 costFunction.m - Programming Assignment 2 Machine Learning
costFunction.m - Programming Assignment 2 Machine Learning
Aladdin Persson
9 predict.m - Programming Assignment 2 Machine Learning
predict.m - Programming Assignment 2 Machine Learning
Aladdin Persson
10 costFunctionReg.m - Programming Assignment 2 Machine Learning
costFunctionReg.m - Programming Assignment 2 Machine Learning
Aladdin Persson
11 lrCostFunction.m - Programming Assignment 3 Machine Learning
lrCostFunction.m - Programming Assignment 3 Machine Learning
Aladdin Persson
12 oneVsAll.m - Programming Assignment 3 Machine Learning
oneVsAll.m - Programming Assignment 3 Machine Learning
Aladdin Persson
13 predictOneVsAll.m - Programming Assignment 3 Machine Learning
predictOneVsAll.m - Programming Assignment 3 Machine Learning
Aladdin Persson
14 predict.m - Programming Assignment 3 Machine Learning
predict.m - Programming Assignment 3 Machine Learning
Aladdin Persson
15 Caesar Cipher Encryption and Decryption with example
Caesar Cipher Encryption and Decryption with example
Aladdin Persson
16 Cryptography: Caesar Cipher Python
Cryptography: Caesar Cipher Python
Aladdin Persson
17 Vigenere Cipher Explained (with Example)
Vigenere Cipher Explained (with Example)
Aladdin Persson
18 Cryptography: Vigenere Cipher Python
Cryptography: Vigenere Cipher Python
Aladdin Persson
19 Hill Cipher Explained (with Example)
Hill Cipher Explained (with Example)
Aladdin Persson
20 Cryptography: Hill Cipher Python
Cryptography: Hill Cipher Python
Aladdin Persson
21 Interval Scheduling Greedy Algorithm: Python
Interval Scheduling Greedy Algorithm: Python
Aladdin Persson
22 Weighted Interval Scheduling Algorithm Explained
Weighted Interval Scheduling Algorithm Explained
Aladdin Persson
23 Weighted Interval Scheduling Python Code
Weighted Interval Scheduling Python Code
Aladdin Persson
24 Sequence Alignment | Needleman Wunsch Algorithm
Sequence Alignment | Needleman Wunsch Algorithm
Aladdin Persson
25 Sequence Alignment | Needleman Wunsch in Python
Sequence Alignment | Needleman Wunsch in Python
Aladdin Persson
26 Codility BinaryGap Python
Codility BinaryGap Python
Aladdin Persson
27 Codility CyclicRotation Python
Codility CyclicRotation Python
Aladdin Persson
28 Derivation Linear Regression with Gradient Descent
Derivation Linear Regression with Gradient Descent
Aladdin Persson
29 Linear Regression Gradient Descent From Scratch in Python
Linear Regression Gradient Descent From Scratch in Python
Aladdin Persson
30 Pytorch Neural Network example
Pytorch Neural Network example
Aladdin Persson
31 Pytorch CNN example (Convolutional Neural Network)
Pytorch CNN example (Convolutional Neural Network)
Aladdin Persson
32 Pytorch LeNet implementation from scratch
Pytorch LeNet implementation from scratch
Aladdin Persson
33 Pytorch VGG implementation from scratch
Pytorch VGG implementation from scratch
Aladdin Persson
34 Pytorch GoogLeNet / InceptionNet implementation from scratch
Pytorch GoogLeNet / InceptionNet implementation from scratch
Aladdin Persson
35 How to save and load models in Pytorch
How to save and load models in Pytorch
Aladdin Persson
36 How to build custom Datasets for Images in Pytorch
How to build custom Datasets for Images in Pytorch
Aladdin Persson
37 Pytorch Transfer Learning and Fine Tuning Tutorial
Pytorch Transfer Learning and Fine Tuning Tutorial
Aladdin Persson
38 Pytorch Data Augmentation using Torchvision
Pytorch Data Augmentation using Torchvision
Aladdin Persson
39 Pytorch Quick Tip: Weight Initialization
Pytorch Quick Tip: Weight Initialization
Aladdin Persson
40 Pytorch Quick Tip: Using a Learning Rate Scheduler
Pytorch Quick Tip: Using a Learning Rate Scheduler
Aladdin Persson
41 Pytorch ResNet implementation from Scratch
Pytorch ResNet implementation from Scratch
Aladdin Persson
42 Pytorch TensorBoard Tutorial
Pytorch TensorBoard Tutorial
Aladdin Persson
43 Pytorch DCGAN Tutorial (See description for updated video)
Pytorch DCGAN Tutorial (See description for updated video)
Aladdin Persson
44 Naive Bayes from Scratch - Machine Learning Python
Naive Bayes from Scratch - Machine Learning Python
Aladdin Persson
45 Spam Classifier using Naive Bayes in Python
Spam Classifier using Naive Bayes in Python
Aladdin Persson
46 K-Nearest Neighbor from scratch - Machine Learning Python
K-Nearest Neighbor from scratch - Machine Learning Python
Aladdin Persson
47 Linear Regression Normal Equation Python
Linear Regression Normal Equation Python
Aladdin Persson
48 SVM from Scratch - Machine Learning Python (Support Vector Machine)
SVM from Scratch - Machine Learning Python (Support Vector Machine)
Aladdin Persson
49 Neural Network from Scratch - Machine Learning Python
Neural Network from Scratch - Machine Learning Python
Aladdin Persson
50 Pytorch RNN example (Recurrent Neural Network)
Pytorch RNN example (Recurrent Neural Network)
Aladdin Persson
51 Pytorch Bidirectional LSTM example
Pytorch Bidirectional LSTM example
Aladdin Persson
52 Pytorch Text Generator with character level LSTM
Pytorch Text Generator with character level LSTM
Aladdin Persson
53 Logistic Regression from Scratch - Machine Learning Python
Logistic Regression from Scratch - Machine Learning Python
Aladdin Persson
54 K-Means Clustering from Scratch - Machine Learning Python
K-Means Clustering from Scratch - Machine Learning Python
Aladdin Persson
55 Pytorch Torchtext Tutorial 1: Custom Datasets and loading JSON/CSV/TSV files
Pytorch Torchtext Tutorial 1: Custom Datasets and loading JSON/CSV/TSV files
Aladdin Persson
Pytorch Torchtext Tutorial 2: Built in Datasets with Example
Pytorch Torchtext Tutorial 2: Built in Datasets with Example
Aladdin Persson
57 Pytorch Torchtext Tutorial 3: From Textfiles to Dataset
Pytorch Torchtext Tutorial 3: From Textfiles to Dataset
Aladdin Persson
58 Paper Review: Sequence to Sequence Learning with Neural Networks
Paper Review: Sequence to Sequence Learning with Neural Networks
Aladdin Persson
59 Pytorch Seq2Seq Tutorial for Machine Translation
Pytorch Seq2Seq Tutorial for Machine Translation
Aladdin Persson
60 Pytorch Seq2Seq with Attention for Machine Translation
Pytorch Seq2Seq with Attention for Machine Translation
Aladdin Persson

This video teaches how to use Torchtext to load and prepare the Multi30k dataset for machine translation tasks, covering key concepts such as tokenization, vocabulary building, and iterator creation. The tutorial provides a hands-on example of how to utilize PyTorch's libraries for NLP tasks.

Key Takeaways
  1. Import necessary libraries
  2. Load the Multi30k dataset
  3. Define tokenizers for source and target languages
  4. Build vocabularies for source and target languages
  5. Create iterators for training, validation, and testing
  6. Define a batch size and device for training
💡 The ability to utilize pre-built datasets and libraries such as Torchtext can significantly streamline the process of building and training NLP models, allowing developers to focus on higher-level tasks.

Related AI Lessons

Want to get started with deep learning
Get started with deep learning by leveraging resources like Andrew Karpathy's playlist and frameworks such as TensorFlow or PyTorch
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Learn to build a deepfake detector from scratch and understand the challenges involved in detecting AI-generated fake media
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Learn about high-dimensional invariance and its relation to the flat 2D plane of neural networks, and how to apply these concepts to improve model performance
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Learn to implement Neural Style Transfer from scratch and understand its significance in deep learning
Medium · Deep Learning
Up next
Image Classification with ml5.js
The Coding Train
Watch →