When small data beats big data
More data is always better? No - not always. Nicole Augustin and I have published a paper entitled When Small Data beats Big Data. The main points are:
- Quality beats quantity. A high quality small dataset is often more informative than a biased larger dataset. Performance is a tradeoff between bias and variance. As the sample size increases, the variance decreases but the bias remains. You don’t need a huge amount of data to achieve an acceptably small variance so that small dataset with no bias due to careful sampling or a controlled experiment will beat the big garbage dump of a dataset. David beats Goliath with a well-aimed shot.
- Cost. There is no free lunch. More data costs money - what did you think those power studies were for? But it’s not just the acquisition costs of data. Some procedures are computationally expensive and the cost increases at a faster than linear rate with data size. If you need your results now, you might do better with less data. Other costs of data are not financial. People value privacy - we should assign a cost to invading that privacy. We should avoid using more data than necessary to protect privacy.
- Inference works better on small data. Statisticians have spent years developing methods for inference on relatively small datasets. Unfortunately, most of these methods don’t work well with big data because the inference becomes unbelievably sharp. The reason for this is that most statistical methods do not allow for model uncertainty or unknown sampling biases. Machine learners do no better as their methods often fail to tackle the uncertainty problem at all. Until we learn how to express uncertainty in big data models, we might be better off sticking with small data.
- Aggregation. Sometimes we have the option of reducing a big individual level dataset to a smaller grouped dataset. Information may be lost by this aggregation but sometimes it can be beneficial. It reduces variation, needs simpler models and reduces privacy concerns.
- Teaching. Students now need to learn about big and small data methods. But where to start? Small data is easier to work with. It’s much simpler for students to understand both the principles and details of the computation without the technical overhead of big data. Students will need to learn big data methods sooner rather than later but it’s best to come into this with a good understanding of the ideas of uncertainty.