Classification and Regression Trees offer a handy aid for genomics and drug discovery
CART (Classification and Regression Trees) is a modern, c.1984, flavor of data mining that employs decision trees and can be used for a variety of business and scientific applications. Its advantages include quick insight into database patterns and significant relationships using simple tools such as graphs, charts and reports. Both
click to enlarge
Figure 1: Main window with data set
novice and advanced modes of menu-driven operation are available to customize the program to the level of user comfort. For the most demanding users, a command line is available. The program can easily handle very large (gigabyte) data sets, and will do so rapidly on today’s faster industrial servers. The theory behind the decision trees is relatively straightforward in terms of illuminating what the model is doing, and the underlying routines have been well-tested under a variety of conditions. This is important to most users, as the literature will daily sport rare and exotic models that rely on obscure mathematics and unproven algorithms.
Recently, this software has found use in genomics and drug discovery, of great interest to the pharmaceutical industry. Numerous presentations and papers have become available documenting predictive models that could separate drug-like from non-drug-like molecules, mechanisms of action, pharmacokinetic models and competitive inhibition behavior. Basically, we would like to look at large numbers of predictor variables and differentiate groups of outcomes (i.e., class x from class y compounds, where x and y can involve anything from simple chemical binding to carcinogenesis). It can be used to predict the behavior of a variety of compounds under many physicochemical circumstances and to rapidly display simple graphics to detail the separations.
click to enlarge
Figure 2: The model navigator
On the genomics side, results have been documented concerning correlations between the physical characteristics of proteins and their advancement through a structural genomics pipeline, prediction of clinical efficacy by classification of genomic expression profiles (via microarrays), and selection of maximally predictive/minimally redundant gene subsets from microarrays.
Enough of this editor’s bias, for now…. Let’s take a peek at the system requirements, operating characteristics and new features in the latest version, prior to running through an actual example.
CART5, the present version, runs on Windows 95/98/Me/NT/2000/XP as well as UNIX and Linux and is extremely CPU intensive. Recommended system requirements include at least a 2 GHz Pentium 4 Processor, 512 MB RAM (dependent on the CART database size purchased), and 2 GB of additional hard drive space for virtual memory and temporary files.
The data translation engine in this software can handle conversions for 80 file formats, including popular statistics programs like SAS and SPSS, and standard spreadsheets such as EXCEL and Lotus
click to enlarge
Figure 3: Main tree split variables
123. Any CART model can be deployed easily when translated into one of the supported languages. The decision logic used in these models is automatically implemented, and the resulting source code is quickly dropped into external applications.
Some of the new features in the latest version are listed in Table 1. Now, before we actually tackle the example, let’s take a look at what a decision tree actually does.
As will be apparent from the diagrams above, these trees are actually flow charts that represent the logic and sequence of splitting the data according to rules partially specified by the user and mainly determined by the software algorithm as it attempts to classify and predict. The tree is built from a series of simple questions that determine the path or branch that a particular data point or record takes. If the right questions are asked, each record is filtered to its ultimate, correct classification, which occurs at the end of the path.
The classes themselves are usually categorical, but need not be. The tree-like visual diagram that results from these techniques makes the discovery of patterns and correlations a bit easier, as the user is not required to scan large
click to enlarge
Figure 4: Details of the main tree
tables but, rather, a simple chart. CART employs binary splits exclusively to grow the tree through simple yes/no type decisions.
The fine details of the splits, the weighting and pruning schemes, as well as the validation and missing data handling, constitute both the strengths and weaknesses of these types of approaches, and each school has its adherents. It is actually up to the analyst to wring out every extra drop of accuracy. As a result, there are almost as many ways to do this as there are analysts.
Now, for an example… Data is easily imported into CART with the usual File/Open/Datafile commands, all easily accessible from its Windows menu (Figure 1). I used an EXCEL .csv file and examples from the program (also .csv). As long as it was a flat file (headers in the first row, all others data with the variables as columns and records as rows), the import worked just fine.
Once the data set is selected, the model type is specified (almost to the nth degree) with the Model Setup button, which is an icon on the menu bar. This is the most important step, as it is here that the general and specific parameters are specified along with the fine tuning. Such items as penalties, prior probabilities, costs, splits and tree parameters are chosen here, amongst many others.
click to enlarge
Figure 5: Tree summary reports
For this example (apparently an attempt at market segmentation), I went with the default parameters, only later changing settings and sometimes seeing dramatic changes in the outputs. The only real choices that were made consisted of specifying the target and predictor variables. Again, this is very easily done with the menu-driven elements. From there, it’s just a matter of hitting the Start button and the analysis will then appear.
The time of this analysis will, of course, depend upon the size of the data set and complexity of the model specified. The analysis appears in a Navigator window that displays the tree topology, relative cost versus number of nodes graph (with the optimal number of nodes per cost highlighted automatically), and a few model statistics (Figure 2).
By clicking on the various buttons or right-clicking in bare areas of the map, a variety of useful data is immediately accessible. The most significant may be found under the ‘Displays and Reports’ section. With a single push of the button, the details of how the splits were made and how the nodes apportion the classes are immediately available (Figures 3 and 4).
The numeric details of the analysis are all readily accessible with the ‘Summary Reports’ button. Here, details of splitting improvements, node populations, variable importance, misclassification and prediction success are found (Figure 5). Lest this looks a bit too simple, remember that the details of the splitting, the automatic processing of missing values and cross validation are but a few of the inputs that require close operator attention.
The online PDF manual that comes with the software provides excellent and straightforward explanations as to how the diagnostics may best be used to interpret the results and discover weak areas in the analysis. While performing a classification on computational chemistry data, I noticed
click to enlarge
Table 1: New Features in CART5
several other very valuable properties of the software, including the automatic generation of exceedingly valuable warnings when the data structure was invalid to the type of calculation desired. For example, in that data set, three classes of enzymes were to be used, and I had initially failed to notice that one class was extremely rare. As a result, representatives of that set were never chosen for any of the cross validation data sets, which would obviously bias any subsequent calculation. While requesting a scoring procedure, a dialog box appeared to inform me of the improper construction of this set.
The very few omissions were not that critical, and would be more important to the statistician than to most end users. Minor annoyances included the omission of several important statistics in the ‘Brief’ tab of the summary reports, the omission of the median and quantiles calculations for the ‘Full’ tab in many instances where they could have appeared, and the blank screen that appears when calculations are started and nothing immediately appears to let you know that something is actually being processed. It was also almost amusing to learn that, if Norton Utilities Speed Disk is running in the background during setup of this software, the license files become misplaced.
Although I feel that the company makes it far too challenging to download the demos, I would encourage interested readers to try, as the results of even one afternoon with the software are rather impressive. Although it might take the average user one to two months to become really proficient with this program, a working knowledge comes a lot quicker. My modest data set of 300 variables by 1500 records was easily analyzed in less than a minute on a 2.8 GHz PC. Couple this with the intuitive, easy-to-use interface, and it becomes a powerful and exceedingly useful tool. Though rather pricey by my standards, some time with the demos is highly recommended.
4740 Murphy Canyon Road #200
San Diego, California 92123
John Wass is a statistician based in Chicago, IL. He may be reached at editor@ScientificComputing.com.