It’s really entertaining to watch the (over)reaction of Twitter on the controversial editorial in NEJM about data sharing and open science. As usual, it’s pretty hysteric but has a potential to cause some real-world consequences. The problem is that the authors were reckless enough to use term “research parasites” for those scientists who use the data from other labs without conducting their own experiments.
In fact, it’s only one paragraph in the editorial that spreads the shock wave around twitter:
A second concern held by some is that a new class of research person will emerge − people who had nothing to do with the design and execution of the study but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited. There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as “research parasites.”
Indeed, several questions popped up in my head while reading this. Who are those front-line researchers who are concerned about ‘research parasites’? The description in the paragraph above draws a picture of Ebenezer Scrooge sitting on a pile of data and refusing to share with those who don’t do experiments on their own. And those who conduct experiments should have enough data anyway. Seriously, is anyone really concerned that data scientists will take over the ‘system’ and ‘steal the productivity of data gatherers’? And is questioning study conclusions a bad thing?
If you don’t consider the context, it seems like it’s quite a large share of scientists that authors are writing about. Especially now, in the age of, you know, Big Data. But the editorial discusses specifically the clinical trials data and actually the main focus is the successful example of data sharing. If you simply delete this paragraph, the text would lose absolutely nothing but bad publicity! But it’s there and too many people saw it already.
There are well-grounded concerns about sharing data, and the authors have some of them listed (for more elaborate discussion I’d suggest the recent piece in Nature). A recent example of highly influential controversy from data analysis is the debate on the major cause of cancer. One study found that there is a correlation between the number of stem cell divisions and lifetime risk of getting particular cancer. So the authors concluded that bad luck is the major contributor to carcinogenesis. The other one questioned the data analysis method by doing a thought experiment and concluded the opposite by doing more data analysis.
So, going back to the editorial, yes, there are examples of improper use or analysis of data. But that’s because data science is very complex, it’s a whole science on its own. Conducting experiment, gathering the data, and analyzing them do influence each other. As well as bad experiment design, bad data analysis will screw up the whole process. Because of that mutual influence, data analysis out of context may often be controversial or even irrelevant. And that’s why experiment-based evidences will be always valued higher than any conclusions from analysis or meta-analysis of public data. But again those evidences should be properly processed and analyzed, and you’ll be safe.