This intelligent engine borders upon data mining, as it cuts through prediction and classification problems
|Multivariate analysis allows deeper insights into patterns and their underlying meaning, facilitating more accurate predictions.|
What’s so remarkable about it? Well, for one, it was originally geared to chemical engineers, but is now used by a wide spectrum of chemists. The package is remarkable for its choice of tools, namely experimental design (DOE) and multivariate statistics. It does a good job of both, and the developers have added many new features (this is NOT your father’s version 9.2!). As with most software, it has its own way of doing things and takes a little getting used to. However, with a little practice, the routines get easier and the learning curve is not steep.
Let’s start in with an overview of the capabilities and then take a tour of the software in action by following one of the tutorials. As was mentioned, the capabilities have been considerably bulked-up from the previous versions, and there are many changes on the main menu. The file menu has been considerably reduced with several choices assigned to other areas.
|Figures 1 and 2|
The edit menu has been simplified, making variable changes slightly easier. This is one area where I would love to see standardization across the industry, as defining variables is always a pain in that the definitions change in nomenclature between platforms and can easily confound the novice. However, as most practitioners concentrate on a single platform, experience soon overcomes the initial confusion.
The other choices, such as View, Plot, Modify, Tasks and Results, all have been simplified and reassigned. Most of these changes were for the better, as the new menu bar offers a cleaner result that is easy to follow and almost intuitive to use. The new Tools menu offers some really nice features, such as the Matrix Calculator, design extensions and Audit Trail, this last being invaluable in today’s documentation-heavy environment. Examples of the Plot and Tasks/Transform menus are given in Figures 1 and 2.
Not to ignore the statistics, where the software has much to offer within its niche, we have the tests listed in Figure 3 (as well as the ability to classify and predict.
Software in Action
Now that we have a brief background on the capabilities, let’s walk through an example. Here, I should apologize to the chemists and engineers, as I have chosen a life science example (and one with which I’m somewhat familiar): Fisher’s classical Iris problem. For the life scientists, as you undoubtedly guessed from the Transformations menu, a lot of UnscramblerX is targeted to analysis of chemical spectra.
|Figures 3 and 4|
As to background, the problem here is to classify three types of iris based upon four measurements taken on a number of plants. The first item is to import the data and define variables. Importing is usually easy from the File/Open or File/Import Data menus (and EXCEL files are imported with minimal hassles), but here we merely click on the tutorial link and the file is imported (Figure 4).
There are four measurement types, as well as a classification column that marks the rows as either Calibration or Training (for some reason a 50:50 split is used, rather extravagant in small samples). This may be confusing to the novice who is used to Training versus Testing nomenclature, but it may be more easily assimilated by the chemists. Note also the addition of a column to the extreme left. This is the classification column that tells the software what samples belong to the same class, a requirement for further analysis. This last task is accomplished by a simple Edit/Inset command followed by specifying the levels of the categorical data variable.
Once the data is imported and formatted, the analysis begins with a simple (hierarchical) clustering of the data, as shown in Figure 5. In this case, there are actually two varieties of iris commingled in the blue area. To try a finer classification, we use Principal Components analysis to get a more complete picture. We get the usual scores and loadings (not shown) as well as the Influence and Explained Variance graphics shown in Figure 6.
It is nice to see that we have already explained approximately 95 percent of the variance with our first two principal components. All graphics can be modified by zooming, coloring and scaling the axes, among other things.
Finally, to get the best split or classification, the user must create class models for each of the iris types by PCA and then choose a classification scheme to do the final grouping. This gets a bit more involved than I would like, but suffice it to say that we produce a detailed separation graphic that is interpretable, as shown in Figure 7.
As we still have overlap, we must be satisfied with incomplete separation. However, this reflects a true problem in the data collection (design) rather than any shortcomings in the software. There is actually a bit more to the analysis than this, but this will give the reader a flavor for the process. Suffice it to say an even finer split can be accomplished.
What I really enjoyed with this new version were all of the helps in the tutorials (the addition of tips and rules of thumb are priceless to the novice) and the readily searchable help files. In addition, there are instructional webinars and instructor-lead classes to assist in mastering the techniques. Little touches like the Beginner/Expert Slider in the DOE routines (which toggles the choice menu between model names and actual descriptions of the model) are really valuable additions.
Anyone seriously interested also should obtain a copy of Kim Esbensen’s excellent text on multivariate data analysis which is available from the Camo Web site.
Other than some nomenclature problems and a few “undocumented features” in the DOE, there is little wrong with the software, and it has very much to recommend it.
The UnscramblerX is not the original multivariate statistics/DOE program that only assists chemical engineers with spectroscopy and chromatography chemometrics. It has broadened into an (almost) intelligent engine that borders upon data mining as it cuts through prediction/classification problems. (Did I mention you can drag and drop data into it directly from an Excel spreadsheet?)
? $7,000 single user, industrial
? $2,500 single user, academic
One Woodbridge Center, Suite 319
Woodbridge, NJ 07095
732-726 9200; Fax: 973-556-1229
John Wass is a statistician based in Chicago, IL. He may be reached at editor@ScientificComputing.com.