Statistics and genetics package features very nifty tools for life scientists
click to enlarge
Figure 1: Initial GenStat window
The Getting Started manual describes this software as “…a complete and comprehensive statistics package…” but, actually, there is a lot more to it. Most packages these days have a variety of statistical tests with diagnostics and options, graphics, data handling capacity, a quality (QA) module and an experimental design (DOE) module. This package seems to have been designed with the life scientist in mind (not surprising, as it comes from a spin-off of Rothamsted Research and the company, VSNi, is subtitled ‘software for bioscientists’). As such, it offers several significant extras. For example: meta analysis, data mining algorithms, time series analysis and microarray analysis. I stressed this last feature in the title, as microarray analysis represents a very sophisticated and specialized area of bioinformatics, which is usually only had as highly specialized and expensive commercial software, or freeware with extensive programming requirements.
The package, therefore, presents the user with a formidable arsenal in which to analyze data as a novice or experienced practitioner. As with most modern statistical software, there are menu-driven elements for the novice and a programming language for the expert. The setup takes a little getting used to, as with all new software. However, due to developer preference for menu arrangement and certain options and output, there are a few other issues that also should be noted.
GenStat is only available for Windows (98, ME, 2000, XP or Vista). This review was done on the Vista operating system. The system recommendations are: 64 MB of RAM and 230 MB of hard drive space for the full installation.
click to enlarge
Figure 2: Main window (tiled)
There no longer seem to be paper manuals provided with the software, but a full electronic set is available under the Help menu on the main tool bar. The Statistics Guide seems to be all command line-driven instructions, while the specialized guides (e.g., Regression, ANOVA, Microarray, etcetera) employ the menu-driven instructions.
Upon first call-up, the GenStat ‘Getting Started’ window appears superimposed upon the Output window. The Navigator pane and main menu bar complete the initial appearance (Figure 1). The Getting Started box allows the user to find out more about the software or open a file. For experienced users, the box may be excluded from start up. Data also may be accessed from the File menu, and GenStat accepts a variety of formats.
I easily imported data from EXCEL, SAS and JMP. However, Minitab files necessitated a (straightforward) format change. The full list of files that can be imported would take up a full page. The dialog boxes for these imports ask far too many questions but are smart enough to do a good job when the analyst repeatedly clicks ‘Next.’ This may prove dangerous, however, since with other software, such as JMP, different analyses and graphics demand to see data in specific formats.
click to enlarge
Figure 3: ANOVA dialog box
Remember when I said that certain features of this software take getting used to? There were several items that ranged from slightly different to bizarre. The slightly different included the main menu bar and some submenus. For example, although the File, Edit, View, Graphics, Stats, Window and Help menus will all be quite familiar to Windows users (and the Run menu to programmers), the others are a bit different. The Data menu contains a mixture of calculators, transforms and data manipulation tools. The Spread (presumably spreadsheet) menu has mostly tools best described as data manipulation helps; yet, it has a submenu called Manipulate with even more. The Tools selection also has things thrown in that seem very broad, but reasonable upon reflection.
The Statistics Menu actually contains improvements over what appear in many programs as it brings all of the tests and routines under a single heading and, thus, makes them much easier to access. For example, some programs will break out specialized analyses such as DOE (experimental design), QA (six sigma), spatial analysis and sample size under their own major headings.
Summarize the above by saying that the basic working environment
• presents a spreadsheet, output, input and event logs as well as the toolbars and navigator pane (all of which may be simultaneously displayed (Figure 2)
• easily imports data
• has a variety of tools to wrestle this data into shape (read manipulate)
click to enlarge
Figure 4: ANOVA graphic options
The menu is fairly complete but probably takes the most patience, as the dialog boxes ask for input in verbiage that may be considered a little different by the novice user. Let’s examine ANOVA, for example. Using their manufacture.gsh data file from the Examples section, we ask for a one-way ANOVA simply by drilling to Stats/Analysis of Variance/One- and Two-way, to get the box in Figure 3.
Here, the new user puzzles over what it wants. However, with a little reflection and a look at the data structure, one quickly sees that Y-variate is the data column and treatments are the groups or levels. The output is also a little new, as the ANOVA table comes up in a simple format without the lines, boxes and colors that we have become used to. It’s reminiscent of plain SAS or STATA output. Also, the usual column we know as the F-ratio is something called v.r. (for variance ratio, which it is), and it took an e-mail to the tech help people to figure out how to get the output to come up automatically when I hit run, and how to quickly switch between graphics and text. Anyway, it does give the standard outputs and some nifty integrated graphics (Figure 4)..
Just a small sample of the capabilities appears in Table 1. Features, such as analysis of microarray data, data mining and meta analysis, are virtually unheard of in most statistical programs. Routines such as time series also are rare.
Table 1: A Selection of Statistical Features
• Analysis of designed experiments (ANOVA)
• Analysis of Microarray data
• Audit Trail/Event Log
• Classification and Regression Trees
• Customizable menus
• Design of experiments
• Display and analysis of circular data (for climatology for example)
• Exploratory data analysis and basic statistics
• Formatted output
• Multivariate analysis techniques
• Point & line identification in 2d & 3d scatterplots
• Regression (Linear, Generalized linear, Nonlinear, Hierarchical Generalized linear)
• REML (unbalanced models, linear mixed models, spatial models, meta-analysis)
• Repeated measurements
• Resampling methods
*• Sample size calculations
• Six sigma Statistical Process Control (SPC)
• Spatial Analysis – including co-kriging
• Statistical distribution fitting
• Survival analysis
• Time series analysis
The graphics available in this software are more than adequate to most tasks. Figure 5 lists the major graphic types available.
click to enlarge
Figure 5: Graph menu
Creation of graphics is not as simple as the novice would like but, after discovering the fine points, the production of detailed, annotated graphics is straightforward. Here, it is important to note that data columns must be formatted to certain properties, and that some 3-D types need grid and matrix formation. As is now standard on most software, by double-clicking on any graphic, a wide variety of options for customization are accessed. Especially impressive were the appearance of zoom and scroll thumbwheels on the graphics that made moving around and magnifying or reducing the size fun and easy.
Keep in mind that this software was designed for life scientists. As such, other than the expected statistical and graphical routines, a number of specialized modules are available. For example, the species abundance models on the statistical menu and the ecology choices under graphics add something for the zoologists and botanists, as does microarray analysis for the molecular biologists and meta- and survival analysis for the clinical crowd. For those in Industry, DOE and Six Sigma routines are particularly useful.
It is an omission that specialized manuals are not available for all of these choices, as the one that is provided (Analysis of Microarray Data) is exemplary for its clarity and concise narration. Even before the analysis, there is a beautifully transparent explanation of the molecular biology underlying the studied phenomena. It then proceeds with walking the reader through designs, simple initial evaluation of data with graphics and, finally, detailed mathematical analysis with diagnostics.
It’s a pity that there is no office for this software on this side of the pond and that the price will fluctuate with the exchange rate. Also problematic is the lone availability of yearly, rather than perpetual licenses. Keep in mind, however, that inclusion of modules such as Microarray Analysis, Meta Analysis, Time Series, and several data mining algorithms are usually found only on the very pricey, high-end programs, Despite the learning curve getting used to terminology and mechanics, this software has so much to offer the biologist (as well as other scientists and data analysts), as to be a treasure trove. Download of a trial version is encouraged.
• commercial, annual rental stand-alone license: £775 GBP per annum
• present rate in US Dollars (1/09): $1,157.
John Wass is a statistician based in Chicago, IL. He may be reached at editor@ScientificComputing.com.