This analysis add-on is simplicity itself
click to enlarge
Figure 1: NeuralTools allows users to train, test and predict.
NeuralTools Professional is a neural network add-on for Excel. As such, we would expect users to be familiar with the Excel environment and, therefore, skilled in its various functions. The developers are betting that a more comfortable and friendly environment will induce more potential users to try the program. From there, they can count on the many user-friendly features to ensure a loyal following. At least with this reviewer, they have partially succeeded. Neural networks have been around for quite a while, and I can remember reviewing software of this type 10 years ago. This makes them bedrock by data mining timelines, and they were the first and one of the most frequently used types of algorithms in this area. Basically, the user has many samples and prediction variables, and wishes to either predict new sample characteristics based upon existing data, or wants to classify a sample into a particular group, based on group characteristics evidenced in previous data sets.
The neural network itself is nothing more than an in silico attempt to mimic the actions of the neurons in the human brain. Without adding in the complexity of numerous interconnections and feedback loops, the simplest networks consist of a few nodes with several interconnections to a second layer that will mathematically weight and re-weight numeric coefficients in an attempt to approach an acceptable estimate of a known quantity. This dependent variable can be numeric or alphabetic and, if numeric, either continuous or categorical. Numerous variations have multiplied since the earliest networks appeared in commercial software, but the present product appears to be the original multi-layer perceptron.
The prediction/classifications problems mentioned above have classically been approached by breaking the known data into two sets and training the net with a large set of data, and then testing the resultant net on the remaining, previously unused, data. Finally, the net is validated (or cross-validated depending upon your preferred terminology) with new data, where the values of the dependent variable are known, prior to being turned loose on the mountains of new data where the values are not known.
Statisticians have historically been somewhat wary of these tools, as they are essentially ‘black boxes’ where the exact methodology and sets of solutions are unknown, yet an ‘answer’ is produced. To the banker or bureaucrat, this may be sufficient, but to the scientist, these answers many times do little to further understanding of the underlying mechanisms being studied. For this reason, neural networks find much greater usage in the applied fields than they do in the basic.
Now let’s examine this offering in a bit more detail so the readers can make up their own minds as to the utility of this specific example.
It might be supposed that any software purporting to have ‘artificial intelligence’ would be a quick and easy load. Here NeuralTools measures up. The software is a 32-bit program for Windows 2000 and higher, requiring 32 MB of memory and at least Excel 2000 on a Pentium or faster machine.
In 32 seconds, the program loaded from a CD and flawlessly integrated with my Excel spreadsheet. My only gripe is that the program itself must be specifically booted to get the neural tool addition. When Excel is booted from my main screen, either from the taskbar or shortcut, it will not include the NeuralTools capabilities.
The software presently comes in two flavors, Standard and Professional. The greatest difference is in file size, as the former will take only 1,000 cases, while the latter will work with up to 16,777,216 cases. This review was done with the Professional version.
Presumably, a print version of the 100-page manual is available, but I found it simple enough to print any sections that were really needed. The PDF version of the online
click to enlarge
Figure 2: Neural Network Manager organizes trained neural nets.
manual is fairly simple to search and to read and, although the index is quite sparse, the table of contents contains enough detail to rapidly direct the reader to the proper sections.
Help for the novice consists of the above User’s Guide, online help, a multimedia tutorial, an FAQ section at the Website and a technical help desk. From a quick perusal of the manual and several hours spent with the software, it appears that, unlike with most software, the bulk of the novice’s questions will deal with interpreting answers and understanding models rather than the menu-driven sequences necessary to complete an analysis.
Now to the details….
The program is simplicity itself and the learning curve extremely gentle. It appears that a lot was done to simplify actions, guess at user needs and integrate intuitive actions whenever possible. This is a fairly easy program to master in about an hour, and time spent after that will enable the user to integrate Excel functions with NeuralTools to extend an analysis or customize a report.
There are four steps to complete an analysis: define and format the data set with the Data Set Manager, select and train an appropriate net, validate the performance on a hold-out set and, finally, predict values using a trained network. To test the nets, I used several genomics data sets that were previously analyzed in other software and biologically characterized by staff scientists. Therefore, the benchmarks to test the accuracy of both prediction and classification, on a limited scale, were readily available. Although most of the examples given with the software are from the economic realm (loans/credit/stocks) or zoology (abalone age), data sets from a variety of disciplines are easily handled. This is expected, as we have a tool that is more concerned with approximating an answer than the type of data used.
To format the data takes mere seconds, as the Data Set Manager will recognize the most common aspects of the Excel format and allow specification of the variable type and header row. Basically, it asks for a range that contains the data and whether headers are present. A simple drop-down allows defining the type of variable (dependent or independent, numeric or categorical). You can use many of the Excel functions to format the data and this software allows tagging for status (train, test or predict) to allow customization of the data partitions. Although it is possible to use 16,777,216 data points per variable (in the Professional edition), the actual data capacity may be less depending upon your system configuration and the inherent memory limitations of Excel itself.
In all instances, the software correctly classified a sample or predicted a biochemical activity (usually in a yes/no fashion). There were also several tools and helps that made
click to enlarge
Figure 3: Customized NeuralTools charts can be taken to other applications.
this a very valuable program in terms of ease-of-use and flexible functionality. We will briefly take a look at several of the most useful.
The training dialog box allows for automatic testing on a random percentage of the data specified by the user, automatic prediction of any missing dependent variable, live real-time updating of predicted values based upon changes in the data, and a variable impact Pareto chart that indicates relative importance of the independent variables to the model (expressed as sensitivity of the predictions to changes in the independent variable).
Although specific net types and number of nodes may be specified, a Best Net Search option allows several net types to be run simultaneously on the data and the results of each displayed with the resultant error, therefore saving time by not having to run each net manually.
The only item that caused some consternation was the Run Time tab. When times such as two hours and 60 minutes appear under ‘Train no longer than’ and ‘Stop when changes in error less than 1% in …minutes’, respectively appears (these are the defaults), I begin to think that this will be a long process. However, with my 22 x 32 data sets, the process was completed in seconds. With sets in the 20 x 5,000 range, only a few seconds longer. The train/test/predict may also be done in a single operation if desired. A training Preview Dialog offers a helpful summary of the network training and any errors detected in the data. My analyses did not display the training progress graphics, presumably due to the speed of the result.
The training output includes a table of net information and data set statistics, as well as a histogram of the training residuals, and accuracy figure (percent of bad predictions), root mean square error, mean absolute error, and scatter plots of actual versus predicted errors. The type of diagnostic presented is dependent upon the data type (numeric or categorical).
The testing output includes a report that shows the data and how well the trained net predicted individual output values. Predictions are flagged as good or bad depending upon the tolerance levels preset by the analyst. Several utilities are included to address managing trained networks and missing data.
The manual includes a reference section on neural networks, giving a brief history and architecture and including advantages and disadvantages of each of the nets included in the package.
Palisade, the software’s developer, is known for providing tools to assess risk and assist decision making. This boils down to statistics and data mining, of which neural networks are one of the more established techniques. Although limited in the type and complexity of the nets supplied, the functionality is such as to extend its use into a number of fields and disciplines. Long avoided by many analysts due to the ‘black box’ nature of their workings, neural nets have found a home in areas where accurate predictions are of the greatest importance, while determining the underlying mechanisms of how the biological or physical process actually operates is of lesser consequence.
Readers wanting to test whether the software is useful to their particular needs are encouraged to download a trail version.
• Professional 1.0: $795
• Standard: $495
798 Cascadilla Street
Ithaca, NY 14850
1-607-277-8000; Fax: 1-607-277-8001
John Wass is a statistician with GPRD Pharmacogenetics, Abbott Laboratories. He can be reached at editor@ScientificComputing.com.