The Purpose of Train-Test Split in Machine Learning | How to Correctly Split Data?
Key Takeaways
Train-test split in machine learning using datasets of features and corresponding prices, emphasizing the importance of splitting data into train and test sets for model evaluation.
Full Transcript
in this video we will cover one of the most important Concepts in machine learning called train test split let's say you have a data set of sold houses collected during the past 5 years consisting of their features and corresponding prices you decide to use all the data for training now when the model is ready you have to understand how good are its predictions how the goal of machine learning models is to learn patterns from a set of data and generalize well to new unseen data since you used all your data for training you don't have observations to test your model on for that reason in most machine learning projects the data set is firstly split into train and test sets the ratio of splitting depends on the size of the available data set and a specific scenario you always wish to have more training data on the other hand you should always allocate enough data for testing your model ideally you want your trained data distribution to be the same as the test data distribution this means that if the the model adequately learns the train patterns it would provide good results for the test set also when you have enough data you can achieve this by randomly splitting the data into train and test sets which ensures that both subsets are representative of the overall data set for little data train test split can be problematic because we may not have enough samples in the splits covering all the necessary examples of the problem domain for instance you have five Villas four huts and one apartment it it does not matter how you split this data you are going to face a problem with the one apartment you have note that you should never look into the distribution of the test set this means that the train test split should be done at the very beginning of any analysis you never want to leak information from the test set to the model you should configure all the hyperparameters train your final model and only at that stage use the test data for evaluation now you may ask how to configure the hyper parameters if you don't know how it performs Ms on unseen data this is done using the validation set a topic we will talk about in the upcoming videos if you want to learn more about artificial intelligence subscribe to our channel to be aware of the new videos press the like button and let's discuss AI in the comments section
Original Description
🔥 In this video we discuss one of the most important concepts in machine learning called: train-test split. Through clear visualizations, we explain the significance of splitting the data into train and test sets and how separate subsets should be properly used. Test data is used for final model evaluation, meaning that it should not be used in any other stage. Validation set is used for model selection and configuration, which is a topic we will talk about in the upcoming videos.
🔍 Key points covered:
0:00 - Introduction.
0:07 - What if you use all data for training?
0:35 - Why data splits are used?
0:41 - The ratio of test/train split.
0:47 - The intuition for splitting the data.
1:05 - Randomly splitting.
1:16 - Little data can be problematic!
1:35 - At what stage to do the train-test split?
1:55 - What about the validation set?
2:07 - Subscribe to us!
🔔 Don't forget to like, subscribe, and hit the bell icon to stay updated with our latest videos!
🤖 Note that we use synthetic generations, such as AI-generated images and voices, to enhance the appeal and engagement of our content.
🌐 If you have any questions or topics you want us to cover, leave a comment below. Additionally, share with your thoughts about the content, how do you think we can make them better? Thanks for watching!
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from AI For Beginners · AI For Beginners · 10 of 32
1
2
3
4
5
6
7
8
9
▶
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Artificial Intelligence Explained In Simple Words | What Is AI? | Explained On A Real World Example!
AI For Beginners
AI vs. ML vs. DL vs. DS - Difference Explained | On Real World Examples | AI For Beginners
AI For Beginners
Types Of Machine Learning Algorithms | Explained On Real World Examples | ML For Beginners
AI For Beginners
Best AI Music Generator | Music Generation Tool for FREE | MusicGen developed by Meta AI
AI For Beginners
The Ultimate Guide To Supervised Learning | Explained On Binary Classification Example | Part 1
AI For Beginners
The Ultimate Guide To Supervised Learning | Classification And Regression | Part 2
AI For Beginners
Linear Regression Explained | A Beginner's Guide To Regression | The Basics You Need to Know!
AI For Beginners
Assumptions Of Linear Regression | What To Do If The Assumptions Do Not Hold? | Part 1
AI For Beginners
Checking The Assumptions Of Linear Regression | Statistical And Visual Methods | Part 2
AI For Beginners
The Purpose of Train-Test Split in Machine Learning | How to Correctly Split Data?
AI For Beginners
The Role of Validation Sets in Model Training | Train-Test-Validation Splits | Clearly explained!
AI For Beginners
Overfitting and Underfitting | Bias and Variance Tradeoff in Machine Learning | Clearly Explained!
AI For Beginners
Gradient Descent Explained | How Do ML and DL Models Learn? | Simple Explanation!
AI For Beginners
Main Types of Gradient Descent | Batch, Stochastic and Mini-Batch Explained! | Which One to Choose?
AI For Beginners
The Role of Loss Functions | Most Common Loss Functions in Machine Learning | Explained!
AI For Beginners
How to Evaluate Your ML Models Effectively? | Evaluation Metrics in Machine Learning!
AI For Beginners
8 Best Tips For Cleaning Your Data | Data Cleaning | Machine Learning, Data Preparation.
AI For Beginners
Numerical vs. Categorical Data | Represent Your Dataset Correctly!
AI For Beginners
3 Main Types of Missing Data | Do THIS Before Handling Missing Values!
AI For Beginners
7 PROVEN Strategies To Become An AI Engineer (2025 Updated)
AI For Beginners
Easiest Guide to K-Fold Cross Validation | Explained in 2 Minutes!
AI For Beginners
Normalization and Standardization | Why to Scale the Features? | ML Basics
AI For Beginners
The Ultimate Guide to Hyperparameter Tuning | Grid Search vs. Randomized Search
AI For Beginners
How is Artificial Intelligence different from Traditional Programming?
AI For Beginners
All Machine Learning Models Clearly Explained!
AI For Beginners
6 Mistakes to Avoid When Learning Machine Learning in 2025
AI For Beginners
Best Practices for Effective Data Visualization In Machine Learning!
AI For Beginners
Central Limit Theorem Intuition Explained Like You're 5!
AI For Beginners
Which Door Would You Choose? | Monty Hall Problem Explained!
AI For Beginners
All Machine Learning Concepts Explained in 18 Minutes!
AI For Beginners
What’s the Probability That Two Randomly Drawn Chords in a Circle Intersect?
AI For Beginners
Causation vs Correlation | The Most Confused Concept in Data Science
AI For Beginners
More on: ML Pipelines
View skill →Related Reads
📰
📰
📰
📰
I Built an AI System That Does in 30 Seconds What Takes a Human 10 Minutes; Here’s What Nobody…
Medium · Machine Learning
What Is MLIR and Why Does It Exist?
Dev.to · Fedor Nikolaev
Why Choosing the Right Machine Learning Development Company Matters More Than the AI Model
Medium · Machine Learning
Data privacy in AI training: federated learning, differential privacy, and synthetic data
Dev.to AI
Chapters (10)
Introduction.
0:07
What if you use all data for training?
0:35
Why data splits are used?
0:41
The ratio of test/train split.
0:47
The intuition for splitting the data.
1:05
Randomly splitting.
1:16
Little data can be problematic!
1:35
At what stage to do the train-test split?
1:55
What about the validation set?
2:07
Subscribe to us!
🎓
Tutor Explanation
DeepCamp AI