International Conference Confronts Data Deluge
The sciences are increasingly confronted with a flood of data, as the amount of data that is produced by scientific experiments and observations grows from year to year. Consequently, the maintenance and structuring of scientific data has to play an ever more important role. Against this backdrop, the International Conference on Scientific and Statistical Database Management (SSDBM) took place in Heidelberg, Germany at Studio Villa Bosch, from June 30 to July 2, 2010, as a joint event of Heidelberg University and the Heidelberg Institute for Theoretical Studies (HITS). About 80 scientists from 10 countries participated.
For more than 20 years, SSDBM has brought together scientific domain experts, databases researchers, practitioners and developers for the presentation and exchange of current research on concepts, tools, techniques and architectures for scientific and statistical database applications. SSDBM provides a forum for original research contributions and practical system design, implementation and evaluation. Individual themes differ year to year with the main focus remaining on databases and application in the scientific and statistical fields. Recent themes included geospatial and sensor databases, bioinformatics (genomics, biodiversity informatics including biological databases), geological databases, data mining and analysis, metadata management, conceptual models, data integration, information visualization, scientific workflows, and system architectures. The priority areas are supplemented with invited talks and panel sessions and illustrated with demonstrations of research prototypes and industrial systems.
Computer scientist Michael Gertz, a professor at Heidelberg University, was responsible for the program and Professor Andreas Reuter (HITS, Heidelberg) served as the conference chair. Keynote speakers were Daniel Abadi, an Assistant Professor at Yale University, and Roger Barga, senior architect of the Cloud Computing Futures (CCF) group at Microsoft Research.
Abadi spoke on “Tradeoffs Between Parallel Database Systems, Hadoop, and HadoopDB as Platforms for Petabyte-Scale Analysis.” He explained that, as the market demand for analyzing data sets of increasing variety and scale continues to explode, the software options for performing this analysis are beginning to proliferate. No fewer than a dozen companies have launched in the past few years that sell parallel database products to meet this market demand. At the same time, MapReduce-based options, such as the open source Hadoop framework are becoming increasingly popular, and there have been a plethora of research publications in the past two years that demonstrate how MapReduce can be used to accelerate and scale various data analysis tasks.
Both parallel databases and MapReduce-based options have strengths and weaknesses that a practitioner must be aware of before selecting an analytical data management platform. Abadi described some experiences in using these systems, and the advantages and disadvantages of the popular implementations of these systems. He discussed a hybrid system that is being built at Yale University, called HadoopDB, that attempts to combine the advantages of both types of platforms. Finally, he discussed his our experience in using HadoopDB for both traditional decision support workloads (i.e., TPC-H) and also scientific data management (analyzing the Uniprot protein sequence, function, and annotation data).
Barga addressed “Emerging Trends and Converging Technologies in Data Intensive Scalable Computing.” There is today wide agreement that data-intensive scalable computing methods are essential to advancing research in many disciplines, he said such methods are expected to play an increasing important role in providing support for well-informed technical decisions and policies. They are therefore of great scientific and social importance.
The growing wealth of data is manifest as increasing numbers of data collections, varying from curated databases to assemblies of files. The former provide reference resources, preservation and computational access whilst the latter are often structured as spreadsheets or CSV files and stored on individual researchers’ computers. Many of these collections are growing both in size and complexity. As computer technology and laboratory automation increases in speed and reduces in cost, more and more primary sources of data are deployed and the flow of data from each one is increased.
At the same time, a growing number of researchers and decision makers are both contributing to the data and expecting to exploit this abundance of data for their own work. They require new combinations of data, new and ever more sophisticated data analysis methods and substantial improvements in the ways in which results are presented. And it is not just the volume of information, but also its scope. It’s becoming more important for different fields of science to work collaboratively to drive new discoveries. While cross-disciplinary collaboration is helping drive new understanding, it also imposes even greater levels of complexity, Barga explained.
This pervasive change is part of a research revolution that introduces a wave of data-driven approaches termed “The Fourth Paradigm” by Jim Gray, as it is so transformative. Current strategies for supporting it demonstrate the power and potential of these new methods. However, they are not a sustainable strategy, as they demand far too much expertise and help to address each new task. In order to carry out these tasks effectively and repeatedly, we must tease out the principles that underpinned their success, and through clarification, articulation, and tools, make it possible to replicate that success widely with fewer demands for exceptional talents, he said. And this will allow researchers to spend more of their time on research.
His talk took an opinionated look at the past, present and future of data-intensive scalable computing, and outlined trends that have recently emerged in the computer industry to cope with data intensive scalable computing, showing why existing software systems are ill-equipped to handle this new reality, and pointing toward some bright spots on the horizon. He also shared his predictions of technology convergence.
For further information: www.ssdbm2010.org