Scientific data analysis left the pen-and-ink stage with the advent of hand calculators and computers some years back. Due to perceived needs, many software routines churned out in academia were converted to commercial packages to feed an ever-hungry industrial and governmental base. The data analysis routines were (and still are) the purely mathematical as well as the statistical. As the packages became specialized, the statistical gained the larger following. In the past, this analysis has been driven solely by customer needs and cost. With the addition of high speed/high memory personal computers and servers, the picture has changed rapidly and will change stru.cturally in the near future, thus, the motivation for this column’s missive. My crystal ball clouds over when I move further out than a few years, so let’s concentrate on the near future
In doing research in the modern laboratory, data analysis is becoming an ever-greater part of the equation. To this end, we examine the present state of analytics. To my way of thinking, it is still important to do some back-of-the-envelope, simple pen and paper calculations before blindly dumping huge data sets into the computer and even more blindly pushing buttons or applying code. In this respect, one of the larger tragedies of modern data analysis, as well as projections on future analysis, is the tendency to equate data analysis with computers only. This tendency far under weighs the usefulness of the biological computer we all possess that will ultimately review the output and decide whether or not it is logical and reasonable. As the purpose of this review is not to indulge in highly speculative human neural engineering of the future, suffice it to say that
• the statistician needs a better grounding in several areas of the basic sciences
• the computer professional could profit from further
• statistical education
• the researcher could profitably be introduced to more statistics and computer science
• most importantly, these groups need closer collaboration
Recognizing that the ideas here are far more important than the tools, but the tools are easier to develop than earth-shaking new ideas, what follows is essentially a projection on the tools.
Over 30 years ago, a talented statistician named John Tukey published a book on exploratory data analysis that was extremely valuable in opening eyes as to simple ways of examining (the then) small data sets to highlight patterns, trends and perhaps discover previously unsuspected occurrences;1 in effect “let the data speak for itself.” In an even wiser vein, he later warned researchers and analysts that, “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”2 Unfortunately, our tools now allow us to extract many answers from huge data sets, and the validity of those answers is not assured by the sophistication of the tools.
In speaking with representatives and developers of several mathematical and statistical software firms, some common threads appear and converge. Interestingly, they concentrated on the hardware aspects as often as the software. These ideas were not developed or, in many cases, even envisoned in the early reviews of this topic.3 Ideas, such as the capacity to handle huge data sets and quickly manipulating and analyzing them, were considered but not really developed. Let us now look at the future of data analysis through the lens of these types…
Multithreading: In the above requirement for handling large data sets and increasingly complex modes of analysis, several developers mentioned multithreading as a primary hardware requirement. We routinely have 2- to 4-core processors on our modern machines that can multithread. However, given a future with higher core capacities, they envision both hardware and software interactions to allow full multithreading employing all of the cores. This parallel computing will allow faster and more accurate simulations using ever greater numbers of iterations. By moving computation from the analtyical CPU to the graphical CPU other temporal economies may be realized.
Collaboration/connections: With the world shrinking daily, and scientific teams becoming ever larger, (note how author lists on molecular genetics papers are rivaling those in particle physics!), the need for quick and seamless integration between geographically disparate teams is mandated. Developers envision methods of data and analysis sharing that go far beyond a simple viewer. Specialized agents that do the smart work of connecting results from different software will be seen more often.
Abstraction: This may cover everything from accessible complex mathematics with more natural symbolism/character definitions through layering between the end user and core engine, and even to proofs (still not there yet).
Functional integration: We would love to automatically apply analytic applications to large data sets. JMP scripting gives a preliminary taste, as scripts can be saved from menu-driven analyses, tied together and summoned with a single button. Soon, this will include mining large data sets with integration from various sources. For example, in genomics, we will finally integrate gene-centric data with protein and pathway data and, ultimately, tie it to molecular biochemistry, pathway analysis and pathology. Self-contained computational knowledge engines are the ultimate results. An early attempt is Wolfram Alpha.
Memory analytics: By placing more and more data in memory rather than on disk, it will be possible to make greater use of dynamic linking and computational speed with 64 bit+ computing.
In database computing: Again, an increase in computational speed may be derived through moving from the analytic client to the database client, which runs on larger machines.
Ease-of-use: With all of the above complexity, it’s almost counter intuitive that running the computer and doing the analysis via software will become simpler. However, many software manufacturers are striving for an ever-more user-friendly approach. Where previously we had Wizards and a help section, we now have context sensitive help, assistants and increasingly detailed tutorials — and it is just the beginning. Smart ‘bots’ that can search other databases and retrieve information; smart visualization for faster, easier, more informative graphics; knowledge engineering and expert systems will be integral to most software. For a specific example, consider the sentence tree,4 where simple words serve as directives and have their own pop-up menus and where choices are offered as needed. The goal is a completely intuitive system that does not force the user to learn a specialized computer language.
Readers interested in pursuing further information concerning the future of statistical analytics are referred to Leland Wilkinson’s recent review.4 He not only details many specifics, but also briefly discusses directions that may be taken by many of the most popular software packages. So, there you have it: faster, more integrated, smarter and easier-to-use.
1. Tukey, J.W. Exploratory Data Analysis. Addison-Wesley, Reading, MA (1977)
2. Tukey, J.W. ‘Sunset salvo’. The American Statistician 40(1) (1986)
3. ‘The future of data analysis’. Anls Mathematical Statistics 33(1) pp. 1–67 (1962)
4. Willkinson, L. ‘Future of Statistical Computing.’ Technometrics 50(4) pp. 418-435 (2008)
John Wass is a statistician based in Chicago, IL. He may be reached at editor@ScientificComputing.com.