A more moderate view of Excel
When I heard that the UK government had lost 16,000 covid-19 cases meaning that the contacts of infected persons were not traced and informed, I got angry. The problem was attributed to an Excel “glitch” so I posted a rather caustic tweet about how Excel should not be used for data analysis. Twitter is a quickfire medium and not the place for nuance. Here’s my more moderate view on the matter:
In a long career, I’ve spent a lot of time cleaning data delivered to me in Excel format. Although it is possible to maintain data cleanly in Excel, this is often not the case. It often takes a lot of time and frustration to restore the data into a form suitable for statistical analysis. Although one could blame the producers of the data, Excel gives them more than enough freedom to mess things up. In particular, there’s no clear boundary in Excel between data and analysis - it’s all mixed together in a single spreadsheet. It’s this experience that makes me angry when I hear about Excel data “glitches”. Please use almost any other format.
Excel is installed on a large proportion of computers and a very large number of people have some experience of using it. It is very versatile software. It’s like a swiss army knife, performing a wide range of tasks, often in adequate manner without excelling (hah!) at any particular task. It’s been so successful because it so useful. Unfortunately, people sometimes fail to recognise when Excel is not the right tool. They are familiar with Excel and are either unaware of or unwilling to learn a more appropriate software tool.
My tweet about the missing covid cases was inaccurate. I said that Excel is not good for (statistical) data analysis when I should have said that Excel is not good for database management. Both are true but in this instance, the error was caused by using Excel in the data handling pipeline. This was a database problem, not a statistics problem.
It’s all too easy to point the finger at people making errors - how could you be so stupid? etc. But all of us make errors because we are human. Our defence against these errors is using systems that are designed to catch errors. In the missing covid cases example, Excel could have been used successfully. Some people might say that the users and not Excel were at fault. But this is the principal weakness of Excel in comparison to purpose-built statistical software or database management software. These latter software systems make it much easier to build in the error checking, the auditing and the reproducibility that are essential for the minimisation of human error.