Is data splitting good for you?

Suppose you have some data and you want to build a model to predict future observations. Data splitting means dividing your data into two, not necessarily equal, parts. One part is used to build the model and the other half to evaluate it in some way. There are two related, but distinct, reasons why people do this. It’s well known that if you use the same data to both build the model and test the performance of that model, you’ll be overoptimistic about how well your model will do in predicting new data. Analysts have various tricks to avoid overconfidence, such as crossvalidation, but these are not perfect. Furthermore, if you need to prove to someone else how well you’ve done, they will be sceptical of such tricks. The gold standard is to reserve part of your data as a test set and build and fit your model on the remaining training set part of the data. You use this model to predict the observations in the test set. Because the test set has been held back, it’s (almost) like having fresh data. The performance on this test set will be a realistic assessment of how the model will perform on future data. But you lost something in the data splitting – the training data is smaller than the full data so the model you select and fit won’t be as good. If you don’t need to prove how good your model is, this form of data splitting is a bad idea. It’s as if a customer orders a new car and asks that the seller drive it around for 10K miles to prove there’s nothing wrong with it. The customer will receive the assurance that the car is not a lemon but it won’t be as good as getting a brand new car in perfect condition.

By the way, if you find your model doesn’t do as well as you’d hoped on the test set, you might be tempted to go back and change the model. But the performance on this new model cannot be judged cleanly with the test set because you’ve borrowed some information from the test set to help build the model. You only get one shot using the test set.

There’s a second reason why you might use data splitting. The typical model building process involves choosing from a wide range of potential models. Usually there is substantial model uncertainty about the choice but you take your best pick. You then use the data to estimate the parameters of your chosen model. Statistical methods are good at assessing the parametric uncertainty in your estimates, but don’t reflect the model uncertainty at all. This a big reason why statistical models tend to be overconfident about future predictions. That’s where the data splitting can help. You use the the first part of your data to build the model and the second part to estimate the parameters of your chosen model. This results in more realistic estimates of the uncertainty in predictions. Again you pay a price for the data splitting. You use less data to choose the model and less data to fit the model. Your point predictions will not be as good as if you used the full data for everything. But your statements of the uncertainty in your predictions will be better (and probably wider).

So judging the value of data splitting in this context means we have to attach some value to uncertainty assessment as well as point prediction. The method of scoring is a good way to do this. All this and more is discussed in my paper Does data splitting improve prediction? and on ArXiv. Although it depends on the circumstances, I show that this form of data splitting can improve prediction.

Julian Faraway
Julian Faraway
Professor of Statistics

Professor of Statistics at the University of Bath