- Logistic regression
- Decision tree
- Neural Network
- Or many others
The validation dataset is not used in training. There are feature matrices and y vectors for both training and validation datasets. The model is fitted with training data, and it is used to predict the y values of the validation feature matrix. Then, the predicted y values (probabilities) are compared with the actual y values.
Multiple comparisons problem (MCP): just by chance one model can be lucky and obtain good predictions because all of them are probabilistic.
The test set can help to avoid the MCP. Obtaining the best model is done with the training and validation datasets, while the test dataset is used for assuring that the proposed best model is the best.
- Split datasets in training, validation, and test. E.g. 60%, 20% and 20% respectively
- Train the models
- Evaluate the models
- Select the best model
- Apply the best model to the test dataset
- Compare the performance metrics of validation and test
NB: Note that it is possible to reuse the validation data. After selecting the best model (step 4), the validation and training datasets can be combined to form a single training dataset for the chosen model before testing it on the test set.
The notes are written by the community. If you see an error here, please create a PR with a fix. |