![]() Image and 3-D plot of Maunga-whau (Mount Eden), a 196-meter volcano in the Auckland, New Zealand, volcanic field. Image courtesy of R Foundation2 |
As a statistician, your contributing editor can personally vouch for the joys of data analysis, but it takes a lot to make him say that programming can ever be construed as anything other than sheer agony. Here is a small exception. For many years, R has been gradually taking its place alongside the venerable pillars of academic statistics such as SAS, S-Plus and Minitab. The primary reason may be price (it’s free) but, given the ease of learning the code, as well as ease of obtaining many well-written (and sometimes even well documented) routines, this language is well worth a try. For the faint-of-heart there is even a Windows GUI to ease the pain.
The R language was developed specifically as a programming environment for statistics and graphics. It is a freeware project, similar to the S language developed at the Bell Laboratories. Although there are differences, unaltered S code will run, in many instances, under R. According to the Web site (www.r-project.org), a variety of highly extensible statistical and graphical techniques are available, and some of the strongest features are the presentation-style graphics which are under full user control. The software runs on a wide variety of UNIX platforms, Linux, Windows and MacOS. R is a fully integrated suite of routines for data manipulation, calculation and graphics. Users may add routines and define these as new functions. For jobs requiring intensive computation, R can link C, C++ and Fortran that is then called at run time. Eight packages are available with the download, and many more are available at various Internet sites to extend the capabilities. New packages are constantly appearing, and user sites spring up to allow communication of novices with the more experienced. Both base system and add-ons are distributed through the Comprehensive R Archive Network (CRAN) that may be accessed at the site referenced above.
Although the new user can access necessary helps and manuals through the site, I used the Everitt and Hothorn handbook1 as a guide. Downloading the software is quite easy, and the Web site contains the full information. The GUI that is produced is perfectly ‘Windows-like’ and contains the standard menu bar and specialized toolbar. This allows not only the expected copy/paste/save/print functions, but also very useful transfer tools for copying and pasting commands from sub-windows to the main window (called a Console), plus another for quickly switching between consoles. Commands are typed at the prompt and output appears on the following lines.
Unfortunately, some of the more important and frequently performed operations that are easy to do in a menu-driven environment take a lot more care (and typing) in the programming environment. Although R can import data from a variety of SQL base engines, including Excel spreadsheets and standard statistical programs such as SPSS, actually querying these databases is non-trivial and requires consultation with the ‘R Data Import/Export’ Manual. Most spreadsheets may be accessed as .csv files, with care taken to specify header and row vectors. Once data is entered, however (and assuming that all necessary packages have been successfully loaded), the pain is greatly reduced.
As with the menu-driven programs, a great deal of the effort comes with the data manipulation steps prior to the actual analysis. Depending upon what needs to be done, this can be relatively straightforward (or not!). For the simple spreadsheet examples that I used, extracting subsets of data was relatively easy with just a few, almost intuitive, commands. Summary statistics and detailed analyses were performed with short code sequences, as in SAS. For example, a multiple linear regression with appropriate scatter plots was constructed with 15 lines of code, including model specification, design matrix, model fitting and plot instructions.
As with SAS, I found that once the instructions for importing data were mastered and the relevant data structure formed, things went a lot faster. I still cannot quickly produce intricate and colorful graphics to match those that effortlessly spring from JMP and many graphics packages, but with practice and communication with the R community, things should quickly improve. As the price is unbeatable and the user community rapidly growing, I would highly recommend this package to the cash-challenged and adventurous.
Availability
R Foundation for Statistical Computing
c/o Department for Statistics and Mathematics
Wirtschaftsuniversität Wien, Augasse 2-6, 1090 Wien, Austria
+43 1 31336 4754, Fax: +43 1 31336 774
www.R-project.org
Resources
1. Everitt, Brian S. and Torsten Hothorn. A Handbook of Statistical Analysis Using R. Chapman & Hall/CRC, Boca Raton, FL (2006). ISBN: 1584885394.
2. R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2005). ISBN 3900051070, www.r-project.org.
John Wass is a statistician based in Chicago, IL. He may be reached at [email protected].