« Back
Jan 8, 2018 3:41:56 PM
The Data Scientific Method

The Oxford English Dictionary defines the scientific method as "a method or procedure that has characterized natural science since the 17th century, consisting in systematic observation, measurement, and experiment, and the formulation, testing, and modification of hypotheses." With more scientists today than ever, the scientific method is alive and well, and generating more data than ever. This explosion of data has brought about the field of data science and an associated plethora of analytics tools. Controversially, some have claimed, such as in this Wired magazine article, that data science is so powerful that it has made the scientific method obsolete. Google's founding philosophy is that “we don't know why this page is better than that one. If the statistics of incoming links say it is, that's good enough.” The implication is that with enough data, people will no longer need to know why something happens, it just does, and that’s good enough. Is it, really?

As a scientist by training, working at a software company focused on data, I’ve got money on both horses, and my answer is no. Actually, more like NO! There is still a big difference between correlation and causation and it is our job, as scientists, to pursue causation – how and why things happen. “Show me the relationships and correlations between these data” is not the same as “What drug(s), in what doses, will cure this disease in this patient.”

Take this this talk about Thiopurine Therapy as an example. When I first saw this presentation, it blew my mind. Here was a therapy that was toxic for some people, and we didn't know why. The result was that the entire class of Thiopurine therapies were unsafe and couldn't be used, even though some people seemed to respond very, very well. Data science revealed correlations of toxicity to genetic mutation that allowed researchers to understand causation - people that had mutations weren't producing enough of the TPMT enzyme to move Thiopurine-based drugs through their cells efficiently enough. The answer was to classify patients based on their genetic capability to produce TPMT, and to adjust their dosage accordingly. It's not enough to find correlations. Correlation might be an "answer," but raises new questions. What is causing that? And then, once we understand the cause, what do we actually do?

This is a lot like the data-information-knowledge spectrum. The data show us that there is a correlation between toxicity and mutation affecting TPMT. The information comes from understanding that TMPT moves certain molecules through the cells, and without it the resulting molecule buildup is toxic. The knowledge is to figure out that we can adjust the dosage based on that information to achieve a beneficial effect for everyone without any side effects.

Of course, rigorously and repeatedly analyzing relationships and correlations between multiple different data sets may well eventually yield the answer, but the scientific method is critical. Steve Miller recently explored the difficulty teasing out the difference between predictive analytics and data science and found that the key differentiator seems to be that experimentation – human intervention – is required in tasks we call data science. Science needs data science in order to find answers, but data science needs science to have the impact the world expects from Big Data.

Enter the data scientific method. You start with an idea or a question, you do some experimentation and analysis and hopefully this gives you the answer you were looking for. This usually means that you start with a lot of data, such as experimental observations or initially unconnected data sources, and you gradually narrow your focus as you conduct your analysis. Idea – Test – Conclusion. Unfortunately, as most scientists will attest, the end point is rarely 'the answer', but is at once both a conclusion and a new question. The new question inevitably requires your focus to broaden out again to include new data sources and additional analytical techniques, until focusing in on the next conclusion. Rinse and Repeat. This cognitive narrowing and broadening repeats itself, like an hourglass (a.k.a. egg-timer) on its side until 'the answer' is hopefully finally reached:

cognitive hourglass data scientific method

And at each stage new data, analysis, and visualization needs arise:

cognitive hourglass data scientific method

To work together well, scientists and data scientists moving through this cognitive hourglass need a data science platform that can rapidly reconfigure to assemble, analyze and visualize data of every conceivable type as new questions and hypotheses inevitably arise. Ideally, the cognitive hourglass would be a real-time process. Otherwise, in a world where time is funding, IT delays caused by redefining and assembling new datasets and sometimes building new data science software can stop the science and the data science in their tracks and stop the science from producing the answers we seek.

It is popularly acknowledged regarding data sicence that knowing the question you want to answer is paramount but often elusive. If you are confident of your question and that the answer lies in the data and analyses you have assembled, then you’re good. If the answer is another question, however, you need to be able to ‘pivot’ or redefine your assembled data and methodology, following your cognitive hourglass in real-time (without having to resubmit as a new project).