State of the Stats: Some thoughts on current statistical software

After doing a few comparative reviews and too many individual reviews to count, this month’s editorial is something of a departure from the usual. It is not a comparative study. (I had built a table of comparative features covering the 10 or so most popular packages that appeared several years ago.) Rather, it is a running commentary of
• what users seem to feel is important in a platform
• how well these items are addressed by present packages
• where the state of the art seems to be headed.

In considering what users look for in a package, the following are usually of prime importance:
• ease-of-use
• sufficiency of the testing/graphics menus
• price

People will rarely buy a package for a single test. However, it was recently brought to my attention by one developer that their (fairly) complete program was mainly used for Six Sigma studies, with clients just about ignoring the other 90 percent or so of the capabilities.

Proceeding with the assumption of users who need more than a single test group — but nothing too fancy — and at a reasonable price, we can examine ease-of-use issues including the addition of useful tools to aid in some of the more onerous tasks. Most modern developers realize that, if their software gets a reputation for being difficult-to-use, requiring programming, being quirky, and having slow, cumbersome implementation of routines, sales will plummet. Therefore, we find each version of statistical software including a few more helps and more efficient Wizards and help/search features, as well as algorithm optimization.

Most offer a specialized programming language to allow further capability and flexibility in routines, a feature ignored by many to most users. Assuming that the end users are scientists, engineers, and technicians, with no formal training in statistics since their undergraduate degrees, a gentle learning curve for both menu-driven sequence and statistical theory is imperative. All of the menu-based programs with which I am familiar (e.g., Prism, Origin, SPSS, SigmaStat/Plot, Statistica, SYSTAT, Minitab, Genstat) as well as the task-based JMP, hue to this rule.

The way in which the learning curve is flattened differs from package to package but, for the most part, all have improved their search engines under the Help/Search main menu option. In the case of Origin and SPSS, output is given in colored, easy-to-read boxes. Prism and JMP assist greatly by intimately bonding most analytic output to graphics (other platforms do this to some extent) and offering explanations for many terms with mouse-overs or the “?” tool. SigmaStat pioneered the use of Wizards that take the novice step-by-step through an analysis with a series of questions. Statistica is an exceedingly complete package that relies on menu-driven steps to complete the most complex analyses and SYSTAT is known for its linear model excellence that is now all menu-driven (although programming is always an option). Genstat includes menu-driven modules to facilitate microarray analysis, as well as Meta Analysis, and uses simple-to-understand dialog boxes. Minitab, now used extensively for QA studies, pioneered the analysis six-pack, whereby a single check in a box produces six control charts from a single data set. The idea behind most of these is to give the analyst maximal output for minimal work. Software packages oriented to professional statisticians, such as SAS and Stata, use code but are the most powerful and offer the most flexibility.

Since we are focusing on maximal output for minimal work, the very important data management tools need mention. All too often in the past, it was necessary for scientists and technicians to wrestle with large data sets in Excel to manipulate them into useable form. For those users not acquainted with the shortcuts, this was doubly difficult. Now, most packages offer some type of sort / stack / transpose / subset tools to make life simple. The trend is to offer more capability, as the tools are phenomenal time savers. JMP and Minitab have made access especially simple by placing the tools under a single menu heading with no drilldowns or submenus.

Naturally, the whole point of the software is to do statistics, so the prospective purchaser would expect an extensive menu of tests with good diagnostics. This is especially important at large, multi-user facilities where scientists and engineers have differing needs dependent upon discipline and department. Here we see programs that started as graphics platforms now bulking up on the statistical side and offering a greater variety of tests. Once a test is performed, it is desirable to have diagnostics presented so that the analyst might get an idea as to how sound the actual results are and whether they can be trusted. The results vary here by test (statistical limitations) and package (developer choice hopefully based upon user feedback). As these diagnostics are doubly important in multivariate analysis and complex areas such as genomics, simulations and meta-analysis, more are being offered.

Now, as to the Help/Search features… many programs are going to paperless electronic manuals that come with the package and reside on the PC. It is thus imperative that these search engines provide optimal benefit to any analyst seeking core information, i.e. ‘Is the test available?‘ ‘How do I perform the test?‘ and ‘How do I interpret the output?‘ Although some make improvements, such as use of multiple descriptors, others merely hone in on the first word and offer little explanation of what the test actually does. There are still many terms poorly cross-referenced. What we need, in many cases, is not smarter engines, but more complete indexing!

Perhaps one of the more pressing needs being addressed these days is for optimization (both code and algorithms). In the good old days of small data sets, results came nearly instantaneously and took little computational resources. With the advent of huge data bases and datasets in the last 10 or so years, the need for fast, efficient computation became exceedingly important. With most new versions of our statistical software, announcements on the new levels of optimization appear more frequently. It seems that, no matter the improvement in the computer capabilities, the amount of data more than keeps up. This is as true in business as it is in the sciences and engineering.

One of the nicer features for even the experienced analyst is the addition of special modules for specialized and complex analysis. As we many times lack sufficient data for statistical power or to calculate a parameter estimate, and as we often are not clever enough to construct a mathematical proof to underpin our brilliant new methodologies, we often run simulations to demonstrate that it ‘looks as if we are on the right track.’ Therefore, the addition of Monte Carlo and resampling methods are a more than welcome addition to more than just one or two packages. As was mentioned, one package even addresses meta-analysis. These are of course high-end additions. Among the once exciting and now mundane inclusions are experimental design, QA/Six Sigma and control charting, power/sample size calculations, and curve fitting routines.

The older programs had started to include neural networks, which were slowly bulked up, but the state-of-the-art has already jumped to data mining, variants of which are now included in several packages. These areas are becoming vital to those disciplines where huge datasets need be patrolled and meaningful relations culled. This is an explosive field in itself and will increasingly be considered by some developers. The limiting factor here is price and inclusion of a highly sophisticated algorithm may rocket pricing into the stratosphere.

While on the subject of specialized modules, many packages have been including discipline-specific types for some time. For the calculations specialized to epidemiology, pharmacokinetics and pharmacodynamics, these have been in place for a while. Now, genomics also is being included (in just one, JMP Genomics, so far). Others will soon follow, given interest and cost factors.

Lastly, a few words on a topic (already alluded to) near and dear to our hearts…PRICE! Many years ago when I first started using statistical software, it was possible to get a fairly complete package for under $400. These were DOS-based and had far less on them then what we see today, but were at least affordable. Today, the packages run from about $600 to over $20,000 for those packages offering specialized modules and optimized to large data sets. These all are achievable by large corporate and governmental clients, but are no longer within the reach of individuals and most academic users. The use of freeware is now widespread in most academic and some corporate facilities, with the R language among the most popular for scientific computing. As this does not come in a shrink-wrapped, menu-driven platform, it was not considered here.

In summary, where have we been? More statistics, graphics and some helps. Where are we going? Greater association of graphics with analysis, greater ease-of-use features, more optimization, more interconnectivity, more specialized modules, and more expensive packages.

John Wass is a statistician based in Chicago, IL. He may be reached at [email protected].

Related Articles Read More >

Unlocking the value of your scientific data

Sofar Ocean debuts Maritime Open Standard, Bristlemouth, at OCEANS 2021

The natural resources industry can no longer afford to be a digital laggard

Cambridge Quantum develops algorithm to accelerate Monte Carlo Integration on quantum computers

Search R&D World