Pytorch Torchtext Tutorial 2: Built in Datasets with Example
Key Takeaways
This video demonstrates how to use PyTorch's Torchtext library to load and utilize built-in datasets, specifically the Multi30k dataset for machine translation tasks. The tutorial covers importing necessary libraries, loading the dataset, defining tokenizers, building vocabularies, and creating iterators for training, validation, and testing.
Full Transcript
[Music] what's going on guys hope you're doing awesome and welcome back for another Pike torch video so in this video I want to show you an example of how to use the inbuilt data sets from torch text so as we can see in the screen we have data sets in a few different categories from you know sentiment analysis to question classification and yeah machine translation etc so it's really a great tool to get started in these different areas of NLP so I'm not gonna go through how to load all of these different data sets it's quite similar from just seeing one example so the one we're gonna go through is the Multi multi 30k data set which is under machine translation where we want to translate from English to German so as usual we have to do our import and so we're gonna import spacy we're gonna from torch text data sets import multi 30k which is the data so we're gonna use and then from torch text the data import field and Buckett iterator okay so what we want to do next is I'm gonna copy this in real quick so what we're gonna do is we're gonna do spacy underscore English spacey dot load TN and as I copied in here to download it you would do this Python M Spacey download and and you would have it of course you would have to have spacey first so people install Spacey and then we want similarly for for the German one so and it's under de and then we want to have our token so we're gonna do define tokenize and this is gonna be I guess some repetition from the last video so we're just gonna do return so text for toke in spacey English tokenizer of text and we're gonna copy this cuz we're gonna use sort of the same but for the German one and then we're gonna use Spacey underscore German GE Earth instead okay so that's the tokenizer then we're gonna do English is equal to filled sequential equals true use vocabulary equals true tokenize equals tokenize English and then lower equals true and then we're gonna do pretty much the same thing but we're gonna copy this and we're gonna do it for German instead and we're gonna use the same pre-processing but we're gonna use the other token so let's see what we want to do now is we want to use the multi 30k dataset so we're gonna do train data validation data and test data and we're gonna get all of those from the multi 30k dot split then we're gonna do eet XTS and then we're gonna do a tuple and then dot d e and then dot B n so extensions here tell us the the source language and the target language so the first one will be German and then so that would be the source language and we want to translate that into English and so when we write the fields we need to match that so this would be German comma English then what we want to do is we want to build our vocabulary so we're gonna do English that build vocabulary train theta and then let's do I don't know max size equals 10,000 and minimum frequency I don't know let's - and then let's copy this and we're just gonna do German build vocabulary alright last thing to do is get our iterators we're gonna do train iterator comma validation iterator and then test iterator bucket iterator dot splits and then we're gonna do train data comma validation data comma test data and I believe I didn't mention this in the last video and we did we use bucket iterator but it's very important that you match so the first one in this tuple is going to Train iterator the validation data which is the second is going to validate an iterator etc so that you you match those and then we're going to define a batch size to be mono 64 and then device is equal to CUDA so then we're gonna do for batch in train iterator we're gonna do print batch and and yeah we can just do that and let's see how it looks like so here we see let's see let's make this a little bit larger so here we see the the different batches all of them have yeah so we have dot source if you want to have this specific the German numerical eyes sentences I guess and then same for the English ones we would you just do batch that source or batch dot target but here we get the overview of the batch so we have 24 times 64 which is the number of examples and we see here that it's 23 so that the translation is sort of of equal length which makes sense right but as we see here it's not always of the same length sometimes the translation can be longer and of course all of them are padded as well maybe one of these translations is only ten in length but the longest one was 28 so everything else has to be pad padded to be exactly 20 native length so then what we what I want would like to show is sort of a a let's see we can do that in the in this one we can do sort of English dot vocabulary dot string to index so string two to index so we can do we can send in a word and we can get the index from the vocabulary so let's say we do that that would be the fifth index if we do something like I then we get yeah 954 and we can also Mac map back and we can do English vocabulary and we can do index to string and that would be let's say we do five we get back the so you can sort of use this if you would like to map Mac map back and you could sort of you could send it into a sequence of sequence Network and you would get some output and you would like to know well what is this actually saying you can go through each of them map them back to the word and then you would have the translated sentence so just a a minor detail that might be interesting to know so in the next video I'm gonna show you sort of a true example of how you would do this in a more real project so so more specifically let's say you have two large text files how would you actually go from just having those text files to actually having a a training and validation and test set and in a format where you can send it to torch text so check that out if you're interested in seeing any more I guess a real example of this and with that said thank you so much for watching this video if you have any questions leave it in the comment section and I hope to see you in the next one [Music]
Original Description
In this video I show you how to use and load the inbuilt datasets that are available for us through torchtext. In the example I show an example of machine translation using Multi30k dataset.
Resources I used to learn about torchtext:
https://torchtext.readthedocs.io/en/latest/
https://anie.me/On-Torchtext/
https://github.com/bentrevett
https://towardsdatascience.com/how-to-use-torchtext-for-neural-machine-translation-plus-hack-to-make-it-5x-faster-77f3884d95
https://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/
❤️ Support the channel ❤️
https://www.youtube.com/channel/UCkzW5JSFwvKRjXABI-UTAkQ/join
Paid Courses I recommend for learning (affiliate links, no extra cost for you):
⭐ Machine Learning Specialization https://bit.ly/3hjTBBt
⭐ Deep Learning Specialization https://bit.ly/3YcUkoI
📘 MLOps Specialization http://bit.ly/3wibaWy
📘 GAN Specialization https://bit.ly/3FmnZDl
📘 NLP Specialization http://bit.ly/3GXoQuP
✨ Free Resources that are great:
NLP: https://web.stanford.edu/class/cs224n/
CV: http://cs231n.stanford.edu/
Deployment: https://fullstackdeeplearning.com/
FastAI: https://www.fast.ai/
💻 My Deep Learning Setup and Recording Setup:
https://www.amazon.com/shop/aladdinpersson
GitHub Repository:
https://github.com/aladdinpersson/Machine-Learning-Collection
✅ One-Time Donations:
Paypal: https://bit.ly/3buoRYH
▶️ You Can Connect with me on:
Twitter - https://twitter.com/aladdinpersson
LinkedIn - https://www.linkedin.com/in/aladdin-persson-a95384153/
Github - https://github.com/aladdinpersson
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Aladdin Persson · Aladdin Persson · 56 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
▶
57
58
59
60
computeCost.m Linear Regression Cost Function - Machine Learning
Aladdin Persson
gradientDescent.m Gradient Descent Implementation - Machine Learning
Aladdin Persson
Neural Network from scratch - Part 1 (Standard Notation)
Aladdin Persson
Neural Network from scratch - Part 2 (Forward Propagation)
Aladdin Persson
Neural Network from scratch - Part 3 (Backward Propagation)
Aladdin Persson
Neural Network from scratch - Part 4 (With Python)
Aladdin Persson
sigmoid.m - Programming Assignment 2 Machine Learning
Aladdin Persson
costFunction.m - Programming Assignment 2 Machine Learning
Aladdin Persson
predict.m - Programming Assignment 2 Machine Learning
Aladdin Persson
costFunctionReg.m - Programming Assignment 2 Machine Learning
Aladdin Persson
lrCostFunction.m - Programming Assignment 3 Machine Learning
Aladdin Persson
oneVsAll.m - Programming Assignment 3 Machine Learning
Aladdin Persson
predictOneVsAll.m - Programming Assignment 3 Machine Learning
Aladdin Persson
predict.m - Programming Assignment 3 Machine Learning
Aladdin Persson
Caesar Cipher Encryption and Decryption with example
Aladdin Persson
Cryptography: Caesar Cipher Python
Aladdin Persson
Vigenere Cipher Explained (with Example)
Aladdin Persson
Cryptography: Vigenere Cipher Python
Aladdin Persson
Hill Cipher Explained (with Example)
Aladdin Persson
Cryptography: Hill Cipher Python
Aladdin Persson
Interval Scheduling Greedy Algorithm: Python
Aladdin Persson
Weighted Interval Scheduling Algorithm Explained
Aladdin Persson
Weighted Interval Scheduling Python Code
Aladdin Persson
Sequence Alignment | Needleman Wunsch Algorithm
Aladdin Persson
Sequence Alignment | Needleman Wunsch in Python
Aladdin Persson
Codility BinaryGap Python
Aladdin Persson
Codility CyclicRotation Python
Aladdin Persson
Derivation Linear Regression with Gradient Descent
Aladdin Persson
Linear Regression Gradient Descent From Scratch in Python
Aladdin Persson
Pytorch Neural Network example
Aladdin Persson
Pytorch CNN example (Convolutional Neural Network)
Aladdin Persson
Pytorch LeNet implementation from scratch
Aladdin Persson
Pytorch VGG implementation from scratch
Aladdin Persson
Pytorch GoogLeNet / InceptionNet implementation from scratch
Aladdin Persson
How to save and load models in Pytorch
Aladdin Persson
How to build custom Datasets for Images in Pytorch
Aladdin Persson
Pytorch Transfer Learning and Fine Tuning Tutorial
Aladdin Persson
Pytorch Data Augmentation using Torchvision
Aladdin Persson
Pytorch Quick Tip: Weight Initialization
Aladdin Persson
Pytorch Quick Tip: Using a Learning Rate Scheduler
Aladdin Persson
Pytorch ResNet implementation from Scratch
Aladdin Persson
Pytorch TensorBoard Tutorial
Aladdin Persson
Pytorch DCGAN Tutorial (See description for updated video)
Aladdin Persson
Naive Bayes from Scratch - Machine Learning Python
Aladdin Persson
Spam Classifier using Naive Bayes in Python
Aladdin Persson
K-Nearest Neighbor from scratch - Machine Learning Python
Aladdin Persson
Linear Regression Normal Equation Python
Aladdin Persson
SVM from Scratch - Machine Learning Python (Support Vector Machine)
Aladdin Persson
Neural Network from Scratch - Machine Learning Python
Aladdin Persson
Pytorch RNN example (Recurrent Neural Network)
Aladdin Persson
Pytorch Bidirectional LSTM example
Aladdin Persson
Pytorch Text Generator with character level LSTM
Aladdin Persson
Logistic Regression from Scratch - Machine Learning Python
Aladdin Persson
K-Means Clustering from Scratch - Machine Learning Python
Aladdin Persson
Pytorch Torchtext Tutorial 1: Custom Datasets and loading JSON/CSV/TSV files
Aladdin Persson
Pytorch Torchtext Tutorial 2: Built in Datasets with Example
Aladdin Persson
Pytorch Torchtext Tutorial 3: From Textfiles to Dataset
Aladdin Persson
Paper Review: Sequence to Sequence Learning with Neural Networks
Aladdin Persson
Pytorch Seq2Seq Tutorial for Machine Translation
Aladdin Persson
Pytorch Seq2Seq with Attention for Machine Translation
Aladdin Persson
More on: ML Pipelines
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Want to get started with deep learning
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Medium · Deep Learning
🎓
Tutor Explanation
DeepCamp AI