Segmentation of Datasets for Train and Test in Machine Learning
Introduction
Dividing the dataset into practice and test sets is an important step in machine model development and testing in the world of machine learning Deep learning of Python language. Correct dataset segmentation enables us to accurately evaluate the performance of the model. This article will explain how to segment datasets for learning and testing in Python machine learning And clear examples and steps to complete this test.
Assuming we have 100,000 samples, and then we input all 100,000 samples into the machine learning model for all trains, where do we test the model? So, how do we know how humane a model is in an unprecedented amount of data, and how do we solve this problem?
In this case, dividing the data set into a training set and test set will help us to evaluate the performance of the model and predict the data it has never seen before, which will be an important tool to evaluate the effectiveness of the model on the data it has never seen before.
In this case, the method of dividing the data set is to divide the 100,000 sample data set into two parts, namely:
Training set: It is used to train a machine learning model so that the model can learn and adjust parameters to match the data in the set. This section contains most data, such as 80% or 70% data.
Testing set: It is used to test the model that we have practiced, and its purpose is to evaluate the efficiency of the model in forecasting data that we have never seen before. This section contains a small amount of data, such as 20% or 30% of the total data.
When we use these two data partitions to select the most accurate model and test set in training, testing, and adjusting hyperparameters, we hope to get a model that can handle data that we have never seen before, and we will encounter a new problem.
The problem is that using the test set so many times makes the model we choose perform best. Bias moves towards the test set, just like some models remember the test set, which produces the same generalization problem, that is, we can’t guarantee that our model can handle the data it has never seen before.
Therefore, we should divide the segmented data into three parts: training set, verification set, and test set. For example, 70,000 is the training set, another 15,000 is the verification set and 15,000 is the test set.
- The training set is used for input and use of training.
2. Validation sets are used to test metrics to determine how the model performs after each update and which model performs better after the trend is completed.
3. The test set is used to test how the model performs on data that has never been seen before after obtaining the best model.
When we use these three parts in Train, Validate, adjust superparameter, … to select the model that is most suitable for the verification set, finally, we will test the test set that represents real-world data, and how it performs in practical application when we release the system.
The information obtained should be similar to the information encountered in most real life. No problem, one kind of data meets another kind of data in actual use, for example, the data is French, but it is actually Asian, and so on.
Before segmentation, the data should be shuffled or split, so that all data sets have similar data distribution or distribution, and are not inclined (data cutting).
In fact, there are many kinds of data segmentation, and there is a popular mode called cross-validation, which we will describe below.
The author would like to thank you ❤ for the content in the article, if there is any mistake, please apologize here.