Ever since ARPANET, the network that became the basis for the Internet and then the subsequent birth of the Web, a revolution in science has appeared to be imminent. While the number of academic papers disseminated by the Web continues to grow, it is safe to say that the Web-based advances in science have not matched the revolutions in other areas, such as music and mainstream publishing. That isn’t to say there haven’t been great steps forward and pockets of success. These include the ability to map and query genomes following the success of the human genome project and, perhaps most notable, the role of computers in the 2013 Nobel Prize in Chemistry for the development of multiscale models for complex chemical systems. However, as the complexity of research increases, the need for larger volumes of data and more processing power to analyze that data has never been greater.
Two recent developments suggest we might be a lot closer to a much more powerful computational science world. On February 22, 2013, the White House responded to a petition on the “We the people platform” which had amassed the required volume of signatures. The petition urged the US government to, “Require free access over the Internet to scientific journal articles arising from taxpayer-funded research.” Noticing a trend in global policies and the social and economic benefits of open government data, the response went one step further, “The Obama Administration agrees that citizens deserve easy access to the results of research their tax dollars have paid for … In addition to addressing the issue of public access to scientific publications, the memorandum requires that agencies start to address the need to improve upon the management and sharing of scientific data produced with Federal funding.”
This fundamental shift has the potential to transform the efficiency of US research, particularly drug discovery and effectiveness. Now, the funding agencies have begun to act and, even more significantly, the biggest funding bodies will be announcing their changes in October 2015. Also, as China continues to publish more research of increasing quality, will this move from Obama’s government prove to be decisively forward-thinking in keeping the U.S. at the cutting edge of research, reaping the educational and financial benefits that go with it?
The second development relates to what to do with all of this data. Could these government pushes, particularly in the Western world, build a global queryable scientific cloud? Supercomputers are at the heart of a huge number of important scientific and defense research projects. Obama recently signed an executive order authorizing the creation of new supercomputing research initiative called the National Strategic Computing Initiative (NSCI). Its goal is to pave the way for the first exaflop supercomputer — something that’s about 30 times faster than today’s fastest machines. Openly-available academic data on the Web will soon become the norm. Funders and publishers are already making preparations for how this content will be best managed. With the coming open data mandates meaning that we are now talking about ‘when’ not ‘if,’ the majority of academic outputs live somewhere on the Web, the big question then becomes, ‘what’s next?’
Open government directives have led to new applications, business models and ways of making use of datasets. While the data produced from governmental surveys and reporting is often much more homogenous than the varied and increasingly diverse structures of academic research outputs, the sheer volume of research in specific fields has led to new and improved ways to interpret the data, ranging from genetic applications such as 23andme, to sentiment analysis on huge cohorts of social science data from systems such as Twitter and Facebook.
There are a few things that need to happen, though, to make this possible. Researchers have rarely had to consider data provenance and persistence as a factor in their work. The Data FAIRport group sum it up well when they suggest that all scientific data should be findable, accessible, interoperable and reusable. This is for both humans and machines alike — and it’s this second part around machines where this struggles occur. This is where platforms like Figshare can help.
By providing a sanctioned and secure place to store data, institutions encourage researchers to move away from using generic cloud-based storage, USB sticks and other poor data management practices that jeopardize the security and longevity of those data sets. Figshare’s private project spaces give researchers a tailor-made solution for sharing and collaborating on their active data. The metrics and reporting dashboard offers institutions unprecedented insight into the outputs being produced and the collaborations being undertaken by their researchers. In addition, the data curation workflows give the institution control over what data is made public and to ensure objects are labeled correctly with appropriate metadata. The final parts to the puzzle involve one very technical solution and one cultural. Firstly, APIs are key to the advancement of data reuse because they allow the linking of siloed platforms and the ability to build applications on top. The second part to the puzzle is Open Access to said data, something that is essential for re-use — what’s the point of finding the data if you can’t use it?!
Through the power of linked open data, the Web should evolve in order to return more accurate data in response to any question that it is posed with. As the world’s largest driver of knowledge, the academic system should provide data to better answer queries at all stages of the learning and educational process. It seems that all of the pieces are falling into place to make this happen. It won’t happen overnight but, for the first time, it seems like we are on the brink of a revolution in academic research.
Mark Hahnel is CEO at Figshare.