The Purpose of Train-Test Split in Machine Learning | How to Correctly Split Data?

AI For Beginners · Beginner ·📐 ML Fundamentals ·2y ago

Skills: ML Pipelines80%Supervised Learning60%

Key Takeaways

Train-test split in machine learning using datasets of features and corresponding prices, emphasizing the importance of splitting data into train and test sets for model evaluation.

Full Transcript

in this video we will cover one of the most important Concepts in machine learning called train test split let's say you have a data set of sold houses collected during the past 5 years consisting of their features and corresponding prices you decide to use all the data for training now when the model is ready you have to understand how good are its predictions how the goal of machine learning models is to learn patterns from a set of data and generalize well to new unseen data since you used all your data for training you don't have observations to test your model on for that reason in most machine learning projects the data set is firstly split into train and test sets the ratio of splitting depends on the size of the available data set and a specific scenario you always wish to have more training data on the other hand you should always allocate enough data for testing your model ideally you want your trained data distribution to be the same as the test data distribution this means that if the the model adequately learns the train patterns it would provide good results for the test set also when you have enough data you can achieve this by randomly splitting the data into train and test sets which ensures that both subsets are representative of the overall data set for little data train test split can be problematic because we may not have enough samples in the splits covering all the necessary examples of the problem domain for instance you have five Villas four huts and one apartment it it does not matter how you split this data you are going to face a problem with the one apartment you have note that you should never look into the distribution of the test set this means that the train test split should be done at the very beginning of any analysis you never want to leak information from the test set to the model you should configure all the hyperparameters train your final model and only at that stage use the test data for evaluation now you may ask how to configure the hyper parameters if you don't know how it performs Ms on unseen data this is done using the validation set a topic we will talk about in the upcoming videos if you want to learn more about artificial intelligence subscribe to our channel to be aware of the new videos press the like button and let's discuss AI in the comments section

Original Description

🔥 In this video we discuss one of the most important concepts in machine learning called: train-test split. Through clear visualizations, we explain the significance of splitting the data into train and test sets and how separate subsets should be properly used. Test data is used for final model evaluation, meaning that it should not be used in any other stage. Validation set is used for model selection and configuration, which is a topic we will talk about in the upcoming videos. 🔍 Key points covered: 0:00 - Introduction. 0:07 - What if you use all data for training? 0:35 - Why data splits are used? 0:41 - The ratio of test/train split. 0:47 - The intuition for splitting the data. 1:05 - Randomly splitting. 1:16 - Little data can be problematic! 1:35 - At what stage to do the train-test split? 1:55 - What about the validation set? 2:07 - Subscribe to us! 🔔 Don't forget to like, subscribe, and hit the bell icon to stay updated with our latest videos! 🤖 Note that we use synthetic generations, such as AI-generated images and voices, to enhance the appeal and engagement of our content. 🌐 If you have any questions or topics you want us to cover, leave a comment below. Additionally, share with your thoughts about the content, how do you think we can make them better? Thanks for watching!

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from AI For Beginners · AI For Beginners · 10 of 32

← Previous Next →

Artificial Intelligence Explained In Simple Words | What Is AI? | Explained On A Real World Example!

Artificial Intelligence Explained In Simple Words | What Is AI? | Explained On A Real World Example!

AI For Beginners

AI vs. ML vs. DL vs. DS - Difference Explained | On Real World Examples | AI For Beginners

AI vs. ML vs. DL vs. DS - Difference Explained | On Real World Examples | AI For Beginners

AI For Beginners

Types Of Machine Learning Algorithms | Explained On Real World Examples | ML For Beginners

Types Of Machine Learning Algorithms | Explained On Real World Examples | ML For Beginners

AI For Beginners

Best AI Music Generator | Music Generation Tool for FREE | MusicGen developed by Meta AI

Best AI Music Generator | Music Generation Tool for FREE | MusicGen developed by Meta AI

AI For Beginners

The Ultimate Guide To Supervised Learning | Explained On Binary Classification Example | Part 1

The Ultimate Guide To Supervised Learning | Explained On Binary Classification Example | Part 1

AI For Beginners

The Ultimate Guide To Supervised Learning | Classification And Regression | Part 2

The Ultimate Guide To Supervised Learning | Classification And Regression | Part 2

AI For Beginners

Linear Regression Explained | A Beginner's Guide To Regression | The Basics You Need to Know!

Linear Regression Explained | A Beginner's Guide To Regression | The Basics You Need to Know!

AI For Beginners

Assumptions Of Linear Regression | What To Do If The Assumptions Do Not Hold? | Part 1

Assumptions Of Linear Regression | What To Do If The Assumptions Do Not Hold? | Part 1

AI For Beginners

Checking The Assumptions Of Linear Regression | Statistical And Visual Methods | Part 2

Checking The Assumptions Of Linear Regression | Statistical And Visual Methods | Part 2

AI For Beginners

The Purpose of Train-Test Split in Machine Learning | How to Correctly Split Data?

The Purpose of Train-Test Split in Machine Learning | How to Correctly Split Data?

AI For Beginners

The Role of Validation Sets in Model Training | Train-Test-Validation Splits | Clearly explained!

The Role of Validation Sets in Model Training | Train-Test-Validation Splits | Clearly explained!

AI For Beginners

Overfitting and Underfitting | Bias and Variance Tradeoff in Machine Learning | Clearly Explained!

Overfitting and Underfitting | Bias and Variance Tradeoff in Machine Learning | Clearly Explained!

AI For Beginners

Gradient Descent Explained | How Do ML and DL Models Learn? | Simple Explanation!

Gradient Descent Explained | How Do ML and DL Models Learn? | Simple Explanation!

AI For Beginners

Main Types of Gradient Descent | Batch, Stochastic and Mini-Batch Explained! | Which One to Choose?

Main Types of Gradient Descent | Batch, Stochastic and Mini-Batch Explained! | Which One to Choose?

AI For Beginners

The Role of Loss Functions | Most Common Loss Functions in Machine Learning | Explained!

The Role of Loss Functions | Most Common Loss Functions in Machine Learning | Explained!

AI For Beginners

How to Evaluate Your ML Models Effectively? | Evaluation Metrics in Machine Learning!

How to Evaluate Your ML Models Effectively? | Evaluation Metrics in Machine Learning!

AI For Beginners

8 Best Tips For Cleaning Your Data | Data Cleaning | Machine Learning, Data Preparation.

8 Best Tips For Cleaning Your Data | Data Cleaning | Machine Learning, Data Preparation.

AI For Beginners

Numerical vs. Categorical Data | Represent Your Dataset Correctly!

Numerical vs. Categorical Data | Represent Your Dataset Correctly!

AI For Beginners

3 Main Types of Missing Data | Do THIS Before Handling Missing Values!

3 Main Types of Missing Data | Do THIS Before Handling Missing Values!

AI For Beginners

7 PROVEN Strategies To Become An AI Engineer (2025 Updated)

7 PROVEN Strategies To Become An AI Engineer (2025 Updated)

AI For Beginners

Easiest Guide to K-Fold Cross Validation | Explained in 2 Minutes!

Easiest Guide to K-Fold Cross Validation | Explained in 2 Minutes!

AI For Beginners

Normalization and Standardization | Why to Scale the Features? | ML Basics

Normalization and Standardization | Why to Scale the Features? | ML Basics

AI For Beginners

The Ultimate Guide to Hyperparameter Tuning | Grid Search vs. Randomized Search

The Ultimate Guide to Hyperparameter Tuning | Grid Search vs. Randomized Search

AI For Beginners

How is Artificial Intelligence different from Traditional Programming?

How is Artificial Intelligence different from Traditional Programming?

AI For Beginners

All Machine Learning Models Clearly Explained!

All Machine Learning Models Clearly Explained!

AI For Beginners

6 Mistakes to Avoid When Learning Machine Learning in 2025

6 Mistakes to Avoid When Learning Machine Learning in 2025

AI For Beginners

Best Practices for Effective Data Visualization In Machine Learning!

Best Practices for Effective Data Visualization In Machine Learning!

AI For Beginners

Central Limit Theorem Intuition Explained Like You're 5!

Central Limit Theorem Intuition Explained Like You're 5!

AI For Beginners

Which Door Would You Choose? | Monty Hall Problem Explained!

Which Door Would You Choose? | Monty Hall Problem Explained!

AI For Beginners

All Machine Learning Concepts Explained in 18 Minutes!

All Machine Learning Concepts Explained in 18 Minutes!

AI For Beginners

What’s the Probability That Two Randomly Drawn Chords in a Circle Intersect?

What’s the Probability That Two Randomly Drawn Chords in a Circle Intersect?

AI For Beginners

Causation vs Correlation | The Most Confused Concept in Data Science

Causation vs Correlation | The Most Confused Concept in Data Science

AI For Beginners

The train-test split is a crucial step in machine learning that involves dividing a dataset into training and testing sets to evaluate a model's performance. This video explains the importance of splitting data and how to do it correctly.

Key Takeaways

Collect a dataset of features and corresponding prices
Split the data into train and test sets
Configure hyperparameters using a validation set
Train a model on the training data
Evaluate the model's performance on the test data

💡 The train-test split should be done at the beginning of any analysis to prevent leaking information from the test set to the model.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Pipelines

View skill →

Building a Dog Breed Identifier App from scratch - DogNet

Building a Dog Breed Identifier App from scratch - DogNet

Aladdin Persson

Complete Dockers For Data Science Tutorial In One Shot

Complete Dockers For Data Science Tutorial In One Shot

Part 6 | Deploy ML Model on Kubernetes | Auto-Scaling with HPA and Monitoring with Prometheus

Part 6 | Deploy ML Model on Kubernetes | Auto-Scaling with HPA and Monitoring with Prometheus

Abonia Sojasingarayar

Vertex Pipelines: Qwik Start

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Automate R scripts with GitHub Actions: Deploy a model

Related Reads

CLM Series/Chapter-7: Prediction Without Experience Is Blind

Learn how prediction without experience can be limited in data science and how to improve predictive models with experience and data

Medium · Data Science

Analisis Komparasi Performa Arsitektur AlexNet dan VGG-16 pada Klasifikasi Dataset CIFAR-10…

Compare the performance of AlexNet and VGG-16 architectures on the CIFAR-10 dataset using CPU, and learn how to implement and evaluate these models

Medium · Data Science

Analisis Komparasi Performa Arsitektur AlexNet dan VGG-16 pada Klasifikasi Dataset CIFAR-10…

Compare the performance of AlexNet and VGG-16 architectures on CIFAR-10 dataset using CPU, and learn how to implement them in Python

Medium · Python

Converting Python Code to JSON

Learn to convert Python code to JSON format and vice versa to enhance data interchange and storage capabilities

Medium · Python

Chapters (10)

Introduction.

0:07 What if you use all data for training?

0:35 Why data splits are used?

0:41 The ratio of test/train split.

0:47 The intuition for splitting the data.

1:05 Randomly splitting.

1:16 Little data can be problematic!

1:35 At what stage to do the train-test split?

1:55 What about the validation set?

2:07 Subscribe to us!

1. Overview of Artificial Intelligence | What is AI? Fundamental Concepts & Complete History of AI

Professor Rahul Jain