This freeware package is far more than a glorified neural network system
I had originally intended this review to be subtitled “Data Mining on the Cheap” as this is freeware, but not wanting to bias attitudes against free software that is actually very powerful in its own right, and planning to do more freeware reviews, a more dignified title was mandated. In future, there will be a short column review on the WinBugs Bayesian analysis program and a longer review on the ‘R’ language.
Weka is a product of the Computer Science Department at the University of Waikato in New Zealand. The little bird (a Weka, looks something like a Kiwi) on the GUI chooser should be a tip-off, I guess. Anyway, this freeware is useful for mining underlying data patterns in large data sets but is far more than a glorified neural network system. It does a variety of regression, clustering and classification problems as well as many statistical analyses to assist in better discovering and understanding patterns in large (or smaller) data sets. It contains several familiar neural networks as well as others developed specifically for the program. To make it even more useful, there are extensive graphics capabilities so that one may visualize patterns and results to further suggest different lines of analysis.
As this program requires Java to run, the Web site tells us: Java 1.4 (or later) is required to run Weka 3.4.x and older versions. Depending on your computing platform you may have to download and install it separately. It is available for free from Sun. As of February 22, 2006, the developer branch (“HEAD”) from the CVS needs Java 5.0 to compile and run. The developer versions, starting with 3.5.3, also require Java 5.0.
Weka may be run on Windows, Linux or a variety of other platforms, but I used version 3.4.6 on Windows. There are the stable releases, called “book versions” (the 3.4 series), and “developer versions” (3.5 series) for the more adventurous. Either may be quickly downloaded and installed. Once set up, there are numerous on-line helps and documentation, but the best introduction would be the data mining text by Witten and Frank.1 There is also a very active users group that can be accessed on line at list.scms.waikato.ac.nz/mailman/listinfo/wekalist.
When Weka is first booted, the GUI Chooser appears. This allows immediate access to the four main platform areas. The first is the Simple CLI, Weka’s command line interface. From that area, all of the major functions may be invoked for those proficient in JAVA. For the squeamish, there is the Explorer, where data may be clustered, classified, associated and visualized with a number of algorithms by menu-driven commands. The Experiment Environment allows for regression and classification using still other sets of algorithms and the Knowledge Flow area is perhaps the most comprehensive in its selection of tools and analytics. In this area, the user can import and save data sets, filter supervised and unsupervised classifiers, apply a wide variety of Bayes and frequentist algorithms, cluster the data, train and test subsets of the data, and invoke a variety of plot types to visualize data and results with a variety of graphics. The Knowledge Flow area also has a brief help section to orient the new user as to program needs and capabilities. Parts of this area are ‘works in progress’ so there may not be full functionality in all sections.
The interested reader is encouraged to step through the simple, classic, Fisher’s Iris problem (this is one of the data sets that are included with the software) for a simple introduction to its use.
I found Weka a bit on the finicky side with regard to data formats, but once data is correctly imported, the analytic steps are far easier and the work goes smoothly. I would wish for more online explanations and helps for the statistical newcomers, but the developers usually refer users to the Witten and Frank text for these questions.1 The users group is most occupied with programming questions, so for those who wish to push the limits with this software, there is always someone available to answer questions.
Recently, it was announced that Pentaho, an open source Business Intelligence organization, acquired the Weka project. What this means for future access and development was not clear from the information provided at its Web site, but it seems that, for the immediate future, nothing much has changed. Weka is a freeware (read much prized by academe) platform that commands respect in several areas and may become a very powerful alternative for a variety of scientific and business applications. From the number of students using the site, it appears that it is being utilized in many learned theses, and thus may one day be a standard tool in the data-mining arena.
Availability
• Free
University of Waikato
Department of Computer Science
Private Bag 3105
Hamilton
NEW ZEALAND
+64 7 838 4021 Fax: +64 7 858 5095
[email protected]; www.cs.waikato.ac.nz/ml/weka
Reference
1. Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques, 2nd Edition, Morgan Kaufmann, San Francisco, 2005, ISBN 0120884070.
John Wass is a statistician with GPRD Pharmacogenetics, Abbott Laboratories. He may be reached at [email protected].