As a statistician, I see huge issues with the way science is done in the era of big data

Kai Zhang:

What is causing this big problem? There are many contributing factors. As a statistician, I see huge issues with the way science is done in the era of big data. The reproducibility crisis is driven in part by invalid statistical analyses that are from data-driven hypotheses – the opposite of how things are traditionally done.

Scientific method

In a classical experiment, the statistician and scientist first together frame a hypothesis. Then scientists conduct experiments to collect data, which are subsequently analyzed by statisticians.

A famous example of this process is the “lady tasting tea” story. Back in the 1920s, at a party of academics, a woman claimed to be able to tell the difference in flavor if the tea or milk was added first in a cup. Statistician Ronald Fisher doubted that she had any such talent. He hypothesized that, out of eight cups of tea, prepared such that four cups had milk added first and the other four cups had tea added first, the number of correct guesses would follow a probability model called the hypergeometric distribution.

Such an experiment was done with eight cups of tea sent to the lady in a random order – and, according to legend, she categorized all eight correctly. This was strong evidence against Fisher’s hypothesis. The chances that the lady had achieved all correct answers through random guessing was an extremely low 1.4 percent.

That process – hypothesize, then gather data, then analyze – is rare in the big data era. Today’s technology can collect huge amounts of data, on the order of 2.5 exabytes a day.