Partek Genomics Suite could see much wider future distribution
This month’s column is both a chance to not-so-subtlety move readers’ attention to the confluence of biology and mathematics via genomics software, and an opportunity to sneak in a short review, as I have had so many requests of late that I fall behind even my laggard schedule.
click to enlarge
Partek Genomics Suite is advanced statistics and interactive data visualization software.
Before springing this type of software onto my audience, a few words of background are needed for my chemist and engineer friends who may not do this type of analysis. Genomics is presently an extremely active field, as it is highly funded and thus attractive to a number of researchers. Presently, molecular biologists and statisticians are its most ardent practitioners. Basically, we are interested in determining whether different disease states or chemical treatments will amplify or depress the actions of genes, specifically their activity and products. This is done on microarrays; small plastic chips containing gene surrogates tagged with fluorescent molecules. The patterns and intensities of the fluorescence for the various treatments are computationally analyzed to yield information on gene activity. The trick is to pick the true signals out of a sea of noise, and to try to connect the statistical to the biological. Partek Genomics Suite is one of a number of software products that do the former. The latter is really up to the scientist.
The Partek Suite offers a fairly comprehensive set of statistical and interactive visualization tools that serve as an integrated approach to genomic data analysis. This involves not just data importation, but annotation of gene sets and interactive analysis, whereby the statistics and graphics can ebb and flow about any change in the analytic approach and yield fast and simple, yet powerful, methods to understand drivers of variation in the data. As genomic data sets tend to be large (10,000 to 50,000 columns and 10 to 500 rows), the software usually runs from a server. Partek is a welcome exception to this restriction and runs from an individual PC.
Let us begin with a trial run. Partek provides several case studies as tutorials to guide the new user through the analysis of a “typical” genomic data set. I chose the Breast Cancer Study using data available from an NIH Web site. This was my first very pleasant surprise from the software, i.e., ease of data importation. With many data sets, the user can cut/paste from a standard Excel file using .xls, .cvs, or .txt file types. In genomics, we are forever encountering specialized file types, and most of the software is either bundled with manufacturer’s hardware, or challenging to import into other analytic packages. Sometimes, the data is tied to special annotations or treatment types that need be specified in a precise manner (read: the import Wizards ask a lot of questions). In any case, importation is rarely as simple as we would want. Partek Genomics Suite seems to have found a way to do this in a fast and very simple manner, and data may be imported directly from Excel, cut and pasted in, or opened in a special Partek .fmt format. It also accepts the standard Agilent and Affymetrix chip formats, as most genomics software does. The program runs efficiently, as a 12,600-column spreadsheet loaded in about two seconds. An annoying “Tip of the Day” pop-up may be disabled on the first run.
Once the data is in the Analytic Spreadsheet, the fun begins as analytic graphics are immediately produced. A 3-D scatter plot is typical of the colorful and very useful series of graphics that are readily generated and customized. The axes of the graphic represent the first three principal components of the data set dimensional reduction. These components represent vectors in 3-space that point in the directions of increasing variation of the data. This is used as a mathematical device to separate different factors that affect the spread of the data. Each point in the graphic represents a microarray chip, or a row in the spreadsheet. The colored dots represent different treatments, and points that are near each other denote characteristics that are similar across the entire genome. Each graph can be rotated around its three axes, and size, color, shape and connections may be easily modified to highlight treatment or batch effects. I’m personally not a great fan of principle components analysis (PCA) as if separation is not accomplished in three dimensions, software usually proceeds to n-dimensions where a separation is possible. At that point, relating the mathematics to something that is readily assignable to biology in 3-space becomes problematic. However, for those cases where PCA does proceed to higher dimensions, the user can implement discriminant analysis with Partek.
Molecular biologists tend to love hierarchical clustering as a rough separation tool to examine treatment effects and, again, this software is very strong on meaningful graphics. The clustering routine is not only easy to produce but, with a little practice, powerful modifications are possible. In the breast cancer example, the novice can quickly visualize a clustering according to batch, time and treatment in the same colorful graphic. This allows a fast scan and determination of data trends and patterns that may provide valuable insight as to drivers of the measured outcomes.
The actual numerical analysis for variance sources is done with the very flexible and powerful mixed-models ANOVA. Contrasts are specified, false discovery rate calculated, and a complete ANOVA HTML report generated. An informative ‘Sources of Variation’ graphic is produced that allows variance components to be compared to the error measurement (which is normalized to unity for reference). Although the horizontal axis in this graphic represents an average mean squared variation across all of the genes, it is a simple right-click on a row header to see the variance sources for each individual gene. Further analyses of gene expression differences may be visualized on a variety of quickly constructed dot plots. There is also a good assortment of standard statistical tests to do hypothesis testing and prediction.
As batch effects occur in many studies and may obscure other, more relevant contributors to expression changes, there is a convenient box in which to remove this effect so that variance sources may be recalculated in its absence. Other factors may be similarly removed.
Finally, as expression levels are usually obtained under a variety of treatment conditions, it is of interest to create lists of significant genes under each condition and to compare the lists for overlap. This is again easily done with colorful Venn diagrams that graphically illustrate the degrees of overlap for each comparison. As with all of Partek’s graphics, the color, hue and saturation are adjusted to achieve maximal pedagogical (if not theatrical) effect.
The only downsides that I could discover during my trial runs with several data sets have to do with robustness of the software. There were several occasions when I clicked on certain areas of the screen and the spreadsheet disappeared, to be replaced by very small squares. On several occasions, the software abruptly generated error messages and either shut down the program on its own, or shut down after I hit the button. The fact that all of the error dialog boxes had a ‘Save Error Message’ button gave me a funny feeling that calls to the help desk were expected. As I subsequently found with a call to their technical support group, the software is constantly being upgraded as bugs are discovered. I was quickly sent a newer version that seemed to have solved the above problems.
The other quibbles are all rather minor, such as the lack of an ‘Undo’ button (one of my favorite features in Excel!), the lack of a ‘Print’ button on the graphics boxes, the quirky size of the pop-up graphs, the lack of an even, full-screen tiling effect with the ‘Tile’ button, and the need for more statistical information in the online manual. Due to the ease-of-use features and the extremely helpful graphics, however, this package is something that will probably see much wider distribution in the future.
John Wass is a statistician with GPRD Pharmacogenetics, Abbott Laboratories. He may be contacted at editor@ScientificComputing.com.